Analytics on apples and oranges: Comparing sports

I was an exercise logger long before I had heard the word “analytics.” I became a semi-serious recreational runner in 1996, and after a few months of paper logs, I’ve kept track of my running digitally. Early on, I used one of the first log apps, The Athlete’s Diary (which still appears to exist!), then for many years a modified Excel template created by a couple of people from the Dead Runners Society. Since switching to GPS watches, I’ve used a bunch. Currently I use and like Garmin Connect and Strava, as I wrote last week.

The analytics these new tools provide are impressive. But running is, in lots of ways, easy to be analytical on: There’s time and distance, and therefore speed. There’s effort, for which a heart rate is a pretty great proxy. Add terrain (from GPS or your own logging), cadence (from your watch’s accelerometer), and you can analyze and compare almost everything you do.

But what if you do different sports? Or, worse yet, sports that are very complex. When “constantly varied” is in the definition of a sport, as it is for CrossFit, my preferred version of “functional fitness,” you know it’s not going to be easy to figure out how one workout relates to another. What’s a boy to do? Continue reading

The Missing Chili Pepper, Part 2

Back in June, I reported on the observation David Cottrell and I made that on (RMP), having a chili pepper makes a difference. The chili pepper, recall, is RMP’s way of indicating that at least third of your raters think you are “hot,” whatever that means.

Here’s an updated version of that earlier result:

The relationship between easiness and professor quality conditional on hotness.

The relationship between easiness and professor quality conditional on hotness.

What this updated visual tells us better than our earlier one is that the effect is not linear. For instructors rated not “hot,” the quality increase as their easiness increases is significantly greater than for “hot” professors, especially at the hardest end.

I wondered at the end of the earlier post on whether we might see a difference conditional on instructor “hotness” in the official University of Michigan evaluations, which of course don’t ask anything about hotness. We now have the answer: weeeell…. sort of, with grains of salt and lots of qualifications.

Here’s what we did. We took our dataset of all UM evaluations for the College of Engineering and the College of Letters, Sciences, and the Arts from Fall 2008 to Winter 2013. That’s about 10,000 instructors, including professors, lecturers, and graduate student instructors. Our RMP data has 3,100 instructors, again of all varieties. We were only able to match 715 instructors in these two sets, largely because instructor names are in different formats — and rely on student spelling skills on RMP. (I admit I have a particularly hard name, but I’ve yet to see all my students get it right, and this does seem like a common problem.)

So, with 715 observations, there’s not much we can say, and the data are not conclusive. Here’s the best thing to show:

The relationship between instructor quality on RMP and "excellence" on UM evaluations.

The relationship between instructor quality on RMP and “excellence” on UM evaluations.

On the x axis is an instructor’s quality rating on RMP, on the y axis is his or her median response to the statement “Overall, this is an excellent instructor” (with 0 as “strongly disagree” and 5 as “strongly agree”). The red circles represent instructors who have chili peppers on RMP, the black ones those who don’t. This data doesn’t have instructors not on RMP.

There is a small positive correlation between RMP and Michigan’s own evaluations, which is good news for RMP (and to be expected.) “Hot” instructors are to the high ends of both scales. But there are also plenty of “not hot” instructors with high ratings.

This is what we feel comfortable concluding: if you tell me that you have a chili pepper on RMP, I can tell you it’s more likely than not that you are highly rated on both RMP and in the official evals. The opposite is not true: if you say you don’t have a chili pepper, I can’t tell you anything about your other ratings. And, of course, most University of Michigan instructors are not on RMP at all.

Still, seeing the “chili pepper” difference in our data takes us back to the question of what it might be measuring. I won’t repeat the speculations of the earlier post, but offer a few more. First, maybe it is about looks, after all, as Daniel Hamermersh and Amy Parker’s fantastically titled 2005 paper, “Beauty in the Classroom: Instructors’ Pulchritude and Putative Pedagogical Productivity,” suggests. Looks make a difference for professionals’ earnings, so why not for instructors’ ratings? Another, less depressing and creepy conclusion is that the chili pepper is measuring what psychologist Joseph Lowman has called instructors’ “interpersonal rapport”: positive attitude toward students, democratic leadership style, and predictability.

Of course, those two don’t have to be mutually inclusive: for a few students, the chili pepper may just be a report on how attractive they perceive the instructor to be while for others, as our anecdotal evidence suggests, it may be a measure of positive rapport. Either way, it’s too bad that has to frame the issue like a horny eighteen-year-old frat guy.

Working hard, and harder. Maybe. How would we know?

This post is inspired by two things. First, according to a recent study by Magna Publications, half of academics report that they work harder than they did five years ago. I have no bone to pick with Magna, which produces the very helpful Faculty Focus articles on teaching in higher education. And their survey approach is not unusual. I have filled a few surveys of the kind they seem to have used, by perfectly competent and reasonable people. In fact, one could argue it’s thanks to surveys like theirs that we’ve made a lot of progress in finding systematic discrepancies in faculty lives, among genders and races, for example. But asking people if they “think” or “feel” they are working harder than five years is to invite all sorts of problems. It’s well known we are pretty bad at evaluating the past (or the future) reliably.

The second inspiration comes from a blog post by Steven Wolfram, the creator of Mathematica and a precursor of the “quantified self” movement. Wolfram had tracked his email traffic since 1989 and decided to do a bunch of analyses on his work patterns, overall busyness, and the like.

For a modern white-collar worker, not just Wolfram, email is likely a reasonable proxy for how busy he or she is. I’m sure there is variance between professions (how effing backward can physicians really be on modern communication tools?) and individuals (I am pretty anal at replying to all email sent directly to me, usually within a few hours). For academics, I would argue, the email we send is a good indicator of work, especially of the “how busy am I?” dimension. (Nobody ever says, “I was so friggin’ busy today: I got to work on my article all day without any interruptions.”)

Wolfram’s analyses were cool (to me, at least) and possibly creepy (to many of you), but it inspired me to try something similar. I have used email since 1987, but I’ve only tracked it regularly since 1998. Unfortunately, due to data incompatibilities (and an unfortunate detour to Microsoft products), I have a solid record only since 2004. Even that record has some problems, but before I discuss them, let’s look at the data:

Sent mail 2004-2013

Monthly sent mail totals.

Here’s how many emails I sent each month. The total is is 24, 861 message between late 2004 and last week. The highest month is 641 messages. You’ll notice a couple of things right away:

  • There are gaps in 2007. That’s just missing data; my record-keeping system has been imperfect. (That’s when I got tenure, but I did not stop using email. I was probably just more careless about my quarterly archiving.)
  • 2011 seems awfully light. What happened? What happened was that I messed the setting on my laptop, so that my sent mail on that computer didn’t get tracked. For 2011, all you see are the messages I sent from my office computer.

So that little technical snafu makes it kind of pointless to do descriptive statistics such as daily or monthly averages and standard deviations. (I have them if you want, though.) But you can eyeball the data: summers are obviously lighter, and my sabbatical, academic year 2008-2009, is also lighter, but not dead. There is a slight upward trend, but the most noticeable thing are the whopper months, especially after tenure. So if I had to fill out a survey that asks, “Are you busier now than you were five years ago,” the crunch of February through April (I was grad admissions director, so it was busy) might loom large, and I might say, “Yes!” even though it’s not that much worse.

See, if I look my daily email habits, following Wolfram, I see pretty much the same pattern: I send emails all of my waking hours:

Distribution of sent email, 2004-2013

Daily email sending practices, by time of day.

On the x axis here are individual days, on the y axis is the time of day when the message was sent. (Each dot is a message.) I have not pulled an all-nighter since my first year of grad school; the seeming night-time messages are from travels in different time zones. Now you see that, indeed, around 2011 I only kept messages sent from my office machine. I work through the day, but I only go to the office during bourgeois working hours.

The key point I want to make, though, is that even with its gaps, this data is better than a rough sense of how I might feel. Of course, I don’t want to generalize from my experience to that of any other academic. So, on what basis do you say you are busier now than you were five years ago?

Nerd section:

How did I get the data? I’ve archived all my email; the post-2004 email was all on Apple Mail. I used an Applescript script to scrape all those almost 25,000 messages into a CSV file, with time stamps, and then I played with R to analyze it. Thanks to Cait Holman and David Cottrell for help with the technical stuff.

The Missing Chili Pepper, Part 1

In a conversation about, a colleague once yelled (after a few drinks), “I don’t care about the rating. All I want is the chili pepper!” Well, it turns out that if you get the chili pepper, you are more likely to have a good rating, too., or RMP, is a crowdsourced college instructor rating system that our colleagues generally hate, ignore, or know nothing about. There are reasons to be critical of it: It’s anonymous (so people can vent and be mean — or not even be students in your course). Its response rates are spotty and generally bimodal (most instructors have no rating,¬† and mainly it’s only those who like you or hate you bother to rate you). And some of the variables RMP cares about are not conducive to good learning: you can be rated on “easiness,” and, most problematically, students can decide whether you are “hot” by giving you a chili pepper. Why should anyone care about that? Why should they be paying attention to the way you look, for cryin’ out loud?

OK, so it’s not great. But there are things to say in favor of RMP, too. It doesn’t only track problematic things such as easiness or the professor’s perceived hotness. It asks about the professor’s clarity and helpfulness, too. A professor’s score on those items in fact constitutes his or her overall “quality” score. Professor quality simply is the average of his or her clarity and helpfulness scores.

Also, for better or for worse, in many places, such as at the University of Michigan, RMP is the only thing students have to evaluate instructor quality. At Michigan, we do not make our own, official teaching evaluation results available to the students. (We are hoping to change that. Stay tuned.) And despite the things some faculty say about students, they are not stupid. Talking to students about their use of RMP tells me that students are pretty good at understanding the ways in which the tool is imperfect.

But let’s return to that chili pepper. I am heading a project that looks at how we evaluate teaching at the University of Michigan. I have the luxury of working with a talented grad student, David Cottrell, who is doing wonderful things with data analysis. Among other things, he scraped¬† all UM instructors’ ratings from RMP (all 31,000 of them, for more than 3,000 instructors). Below is an interesting chart (click on the image for full size):


In case you don’t want to look at it carefully, let me summarize the details. On the x-axis is an instructor’s easiness score; on the y-axis the “Professor quality.” The red line represents those professors rated “hot” (which means that at least one third of their raters gave them the chili pepper). The blue line represents those instructors who didn’t receive a chili pepper.

Some observations:

  • There is some correlation between an instructor’s perceived easiness and his or her overall quality, but not strict. In other words, quality doesn’t just track easiness. RMP isn’t just a tracker for an easy A.
  • There is a stronger correlation between easiness and quality for instructors who don’t have the chili pepper.
  • So, most disturbingly, if you are not seen as hot, you have to be almost as “easy” as the “hardest” “hot” professor to get the same quality rating as that hardass!

In other words: there is a very significant rating penalty for instructors who do not receive a chili pepper.

A bunch of interesting — and troubling — issues arise. What does the chili pepper actually track? Whereas the other RMP measures are scales, the chili pepper is just a yes-no variable. What leads a student to give an instructor a chili pepper? Let’s assume, first, it is all about “hotness,” that is, some kind of sexual/sexualized desirability. Does that mean that only those students for whom the instructor is in the possible realm of sexual objects are even considering it: women for hetero men, men for hetero women, women for lesbian women, men for gay men, and so on? (My hunch is no — we aren’t all metrosexuals, but lots of people are able to talk about attractiveness beyond personal preferences.)

But I have a hunch that the chili pepper tracks something beyond a purely creepy sexual attraction. In fact, I think it might be another measure of the student liking the instructor. It’s not perfectly correlated, but as the chart shows, there is a correlation. It’s still very disturbing — and interesting — if students sexualize or objectify their appreciation for an instructor, at least when invited to do so in such terms.

Please do not suggest that the easy solution to these questions is for me and David to go through all those 3,000 instructors’ websites and see if they are actually hot. Whatever that might mean. But do suggest ways of thinking about the data. We are interested, really.

And in case you wonder why this post is called part 1: we will be able to see whether the chili pepper effect gets replicated in the evaluation data that the University of Michigan collects — and which certainly asks no questions about the instructor’s hotness.