Polimath

Confidence Intervals for Medians and Percentiles

Stephen — Tue, 16 Feb 2016 15:06:25 +0000

Medians are better than means in most interpretation contexts: they’re not affected by skewed or otherwise non-normal distributions. They give a better sense of the “typical” data point. When the mean and median differ, I prefer to use the median.

One problem with using medians is that you can’t calculate a confidence interval for them the same way as you calculate one for a mean. There’s no “standard error of the median”. However, it turns out there is a way to calculate confidence intervals for them.

Let’s be clear about the context. When we calculate a confidence interval for a mean, we’re saying that our data is a sample from some population and that the confidence interval is related to this population mean. Similarly, when we calculate a confidence interval for a median, we’re saying our data is a sample. When there’s a ton of data, a point estimate tells a reasonable story about the population, but when there’s less data, knowing how accurate your estimate is can be important.

I found a nifty bit of math explaining how to calculate them here. I wrote a little R code to implement it here. You can see it here with source for the Shiny app here. Note that you’ll need a test file; here is a small one.

FunFact: The same algorithm allows you to generate a confidence interval for percentages other than the 50th (the median). The code I wrote lets you set the percentile and the desired confidence level.

If you find this useful, let me know!

Defining Cause

Stephen — Fri, 12 Feb 2016 18:39:27 +0000

It rained today and I didn’t have an umbrella, so I got wet. Why did I get wet? What caused me to get wet?

Suppose I was in LA during a drought and it was a weird, one-off shower. You’d say that I got wet because it rained.

Suppose I was in Seattle during an especially rainy season and, uncharacteristically, I forgot my umbrella. You’d say that I got wet because I didn’t have an umbrella.

In both cases, the cause is a necessary condition. In LA and Seattle both, it is necessary (1) for it to rain and (2) for me to not have an umbrella. So a cause seems to be a necessary condition.

There are other necessary conditions in our story. For example, the sun must have risen. Without the sun, the Earth would freeze (or be blown away, it depends on why the sun was gone) so I certainly wouldn’t be getting wet. We don’t say (with seriousness) that the continued existence of the sun caused me to get wet. It’s necessary, but it’s too likely to be considered a cause.

The cause is the least likely necessary condition.

In LA, it was unlikely to rain and very likely (in a drought) that I wouldn’t have an umbrella, so the rain is the cause. In Seattle, it was very likely to rain but very unlikely I’d forget my umbrella, so the lack of umbrella is the cause.

There are times when we won’t agree on identifying a cause: (1) when we consider different sets of necessary conditions, and (2) when we don’t agree on the relative likelihood of two necessary conditions. Let’s consider some examples.

Disagreeing About Cause, Case 1
Suppose I told you that the Vogons were going to make an interstellar highway and that they were supposed to have destroyed the Earth today. At the last second someone got the date wrong and they waited until tomorrow. Now it’s very unlikely that the sun would have risen (no Earth, no sunrise) but it did, just by luck. You could credibly argue that that was the least likely necessary condition because the Vogons are very detail-oriented and are unlikely to get the date wrong. If we don’t know about the Vogons, we won’t attribute my sogginess to an alien clerical error but instead to the rain or my lack of umbrella, depending on the city. We can’t identify all of the necessary conditions, so which is least likely depends on the set of conditions we consider.

Disagreeing About Cause, Case 2
In our LA example we attribute my dampness to the weather because it was less likely than me (quite normally) not bringing an umbrella. However, my wife knows that she saw the weather forecast predicting weather and that she had told me over breakfast that I should take my umbrella. Typically I heed my much smarter spouse, so it was exceedingly unlikely that I’d forget my umbrella. And yet I did. Knowing that, my wife blames me for getting wet. You, not knowing that, blame the weather. (Someone else might ask about what distracted me from taking my umbrella, and blame that.) You and my wife have different subjective beliefs about the likelihood I’d forget my umbrella, so you disagree about the cause. Subjective probability is a central idea here. A little more context can always shift the blame.

This means that cause itself is not objective but rather is subjective, dependent on the conditions identified by the observer and the observer’s beliefs about relative likelihood.

If we can’t be sure of agreement, why bother? One reason is that when we do agree on the relative likelihood of necessary conditions, we can come to agreement about causes. Additionally, even without agreeing on all the necessary conditions to consider it’s possible to agree that something in particular is not the cause.

Suppose your loved one just died in a car accident right after you argued with them. Would you blame yourself? Probably. But suppose your loved one lost control when the brakes failed and they, distracted from the argument, didn’t regain control quickly enough to avoid the crash. Brakes fail like that very infrequently. Unfortunately, you probably argue with your loved one more often than brakes fail like that. Even if you get along very well, they’ve probably driven after arguing with you and they weren’t in an accident before. The brake failure is less likely, so you can console yourself that you didn’t cause the accident.

This might not relieve you completely. You might consider whether you contributed. There are two things to consider. First, was your argument a necessary condition? Would they have crashed without the distraction? If so, relax. But what if you think they wouldn’t have crashed without the distraction? Then secondly you might consider how much you contributed…but let’s leave that for another time.

Here are some questions the reader might consider:

What does partial culpability (cause) mean using this definition of cause?
Does it makes a difference whether the potential causes are human? Sometimes we’re looking for who is responsible.
How can we agree “enough” about the set of necessary conditions being considered to conclude something specific is the cause?

Rating podcast listening experiences using Time Scaled Values

Stephen — Wed, 20 Jan 2016 20:45:10 +0000

Suppose we want to recommend podcast episodes to users. Instead of having users rate each episode, we want to infer from their listening/skipping behavior how much they liked each episode we offer to them. What we want is some kind of rating value we can infer from their behavior that captures how positive an experience they had with our app. In turn this helps us offer more of the kinds of things they enjoy.

How much is it worth for a user to listen to an episode? Clearly listening to more of an episode is better than listening to less of an episode (Assumption 1). Almost as clear is the idea that listening to all of a longer episode shows more engagement than listening to all of a shorter episode (Assumption 2).

Suppose we have two podcast episodes: A is 5 minutes long and B is 30 minutes long.

Alice listens to all of both A and B. We could give A a value of 5 and B a value of 30, the listening time in minutes (supporting Assumption 2). However, basing recommendations on these values will reward too much longer episodes over shorter ones. In this case, B is valued six times what A is. We could give both A and B ratings of 1 (in line with Assumption 1) indicating that Alice listened to 100% of both episodes. This violates Assumption 2 that B should get a higher rating than A.

Bob listens to all five minutes of A and only five minutes of B. We could give both of them ratings of 5, the listening time. However, finishing A shows more satisfaction than skipping out of B 17% of the way into it, which goes against the spirit of Assumption 1. We could give A a value of 1 and B a value of 5/30 = .17. This follows Assumption 1 but emphasizes episode A too much by giving it six times the rating of B when they both captured the same amount of the user’s time.

What we need is a rating that is somewhere between listening time and percent completed.

Percent time completed is calculated as

We can “calculate” listening time as

To get something in between, we need to find the exponent e on duration so that 0 < e < 1.

Here’s the trick: we need to choose two pairs of (listening time, duration) that we want to have the same value. This gives us an equation

with solution

Suppose you want the same Time Scaled Value for listening to 90% of a two minute story and half of a thirty minute podcast. Listening time 1 is 1.8 minutes and listening time 2 is 15 minutes. Solving, this gives us e = .783. Using this, consider the TSV for listening 1.8 minutes to episodes of varying length. We get

This makes sense: as the duration increases, the fraction of the episode heard decreases, so the value decreases. Yet, it’s not as fast a decrease as if we just used the percent time completed. You can see this by looking at, say, the ratio of the TSVs for 5 minutes and 30 minutes: .511 / .126 = 4.07. If we used listening time, the ratio would be 1; if we used percent time completed the ratio would be 6.

Does this make sense? Listening to 1.8 minutes of a 30 minute episode shows that you heard the intro and the very beginning of the story, but passed on the rest. It shows interest, but not much commitment. Listening to 1.8 minutes of a two minute story means you heard most of the story, but 1.8 minutes isn’t very long; you might have just toughed it out rather than reach for your phone to skip the story.

Empirically, we’ve found that .912 works well for the app we’re developing. The ranking of TSVs mirrors other measures of success like the rate at which stories are shared on social media.

If you’re looking for something that gives more weight to longer content but not so much that it swamps the percent of the content, Time Scaled Values are worth a look.

How often can Thomas Bayes check the results of his A/B test?

Stephen — Wed, 18 Feb 2015 22:23:13 +0000

Stopping your A/B test once you reach significance is a great way to find bogus results…if you’re a frequentist. Checking before you have the statistical power to detect the phenomenon will often lead to false positives if you rely on classical/frequentist methods. A Bayesian with an informative null-result prior can avoid these problems. Let’s think about why.

Suppose we are doing a difference of means* on the time a user spends on a site. To the frequentist, the difference of means is entirely a function of the data. This means that as we run the test and check as we go, the first time the data (randomly?) suggests significance, we probably don’t really have a significant result. The p-value is .05, but that means that 5% of the time when there’s no effect, we’re wrong; by checking regularly, we can easily turn what should be a null result into what appears to be a significant result. Such is folly.

Now, consider the Bayesian using an informative prior that the difference of means is zero. The result is a balance between the data and the prior. If there is no effect, the result is a balance between the prior suggesting the null result and the data randomly fluctuating. By balancing it with the null prior, the result of the test fluctuates less and is not likely to give a false positive result.

Note that the prior has to be informative. If we use a flat prior, there is nothing to balance out the fluctuations in the data and we just get the (false positive) frequentist result. The prior has to be informative to protect us from false positives.

For my A/B tests of the difference of means* I like to use a normal prior for the difference having mean zero and having a standard deviation so that there is only a 5% chance of it having an effect larger than the minimum meaningful effect. R code to calculate the standard deviation might look like

prior.sd <- - MinimumMeaningfulEffect / qnorm(.05/2)

What’s the downside? Your results are the weighted average of the data and the prior, so it will take more data to get a positive result. If you have very little data, this is not the approach for you: design your study well using power calculations and be patient. However, if you have a lot of data, this can be just the ticket.

Trust your A/B tests and check them as you go — but do it right.

* Or medians — they’re just as easy to model in a Bayesian context.

When Enough is Enough with your A/B Test

Stephen — Thu, 15 May 2014 12:44:04 +0000

A good A/B test tool should be able to reach the following conclusions:

A beat B or B beat A, so you can stop.
Neither A nor B beat the other, so you can stop.
We can’t conclude #1 or #2 but you’ll need about m more data points to conclude one of them.

The tools I’ve found for analyzing A/B tests can all answer #1. Some of the better ones can answer #3. None of the tools I’ve seen will answer #2 and tell you that A and B are not meaningfully different and that you have enough data to be pretty sure about that.

This has to do with the way most people use hypothesis testing. Stats students are taught to test the simple hypothesis “Is the amount B improves over A positive?” They get a p-value (related to the notion of the effect actually being negative) and go from there. The problem is that the probability that the effect is precisely zero is zero.

Here’s a fix for that: Choose a minimum meaningful effect and test the hypothesis that the absolute value of the effect is smaller than that value.

Here’s an example. Recently I was testing the hypothesis that version B of an app improved daily listening times over version A of the same app. Daily listening times for this app are around 40 minutes and cost difference for implementing B over A is small, so our product owner and I decided that any change less than 30 seconds was not meaningful. This left me with three hypotheses to test:

Listening times for B are more than 30 seconds longer per day than for A (B wins).
Listening times for A are more than 30 seconds longer per day than for B (A wins).
Listening times for B and A are within 30 seconds of each other (tie).

I can test all of these hypotheses with standard hypothesis tests. If none are true, I can assume the mean difference in times is correct (it is our best estimate of the mean, given the data) and do a power calculation (although this is not the standard calculation, it’s pretty straightforward) to tell how much more data I need.

All three questions answered.

I implemented this in R using the ‘shiny‘ package to make it an interactive Web-based tool. A live demo is here and sample code is here on Github. You’ll need a server with shiny-server installed to use and test it or you can run it on ShinyApps.io (like my demo). I found it trivial to install on my Ubuntu server which I run at work for internal use.

Make GitHub R Code Available within R

Stephen — Sat, 08 Sep 2012 01:07:49 +0000

After gradually migrating most of my workflow from Subversion to GitHub I discovered an itty, bitty, tiny, huge freakin’ problem. Part of my old workflow involved me storing code I would use again and again in a public repository then source-ing the code directly into R as needed. It also made this code easy for me to share with others, especially students and collaborators. No problem.

GitHub is superior to Subversion in notable ways, but that’s not our topic here. GitHub does make it easy to read source code directly from the site as plain text. Here’s an example of an address for a bit of code I use almost daily to give me a clean R session.

  https://raw.github.com/shaptonstahl/R/master/Decruft/Decruft.R

Anyone see the problem? Two things: (1) The URL for the plain text version of code hosted on GitHub is reached via a secure connection, and (2) R can’t source via https without the use of an external library. I’m a big fan of R’s external libraries, but it doesn’t fit the purpose of the code. This code usually sits at the top of just about every .R file I write:

  ##  Start fresh!
  source("http://address.of/Decruft.R")

Isn’t that pretty? Short, sweet, easy to remember. This is how I used to do business. Unfortunately, this is the most concise way that I could find for doing it with GitHub:

  ### Fugly code
  if( !any("devtools" == installed.packages()[,"Package"] ) install.packages("devtools")
  library(devtools)
  source("https://github.com/crikey/thats/as/long/as/the/code/Im/sourcing.R")

I checked the Google and such but nobody seemed to be asking precisely what I was: how can I read code stored on GitHub in plain text using http, not https? We will not be discussing how long it took me to come up with a satisfactory solution. Let’s just say it took long enough that I really don’t want anyone else to have to go through it.

Here’s my solution:

Have a Web server running PHP that allows you to create and use .htaccess files.
Choose a URL for the stem of where your code will appear to be.
Use .htaccess to point 404 Not Found errors to a custom error page.
The PHP-based error page uses https to get the live file from GitHub and feeds it to the person requesting it.

It could be worse. I decided that I would request pages from (nonexistent) subfolders of http://www.haptonstahl.org/R/ in order to read code stored under https://raw.github.com/shaptonstahl/R/. So I put two little files in the document root of my site. The first is named .htaccess and contains this:

  ErrorDocument 404 /404.php

The other file is the 404.php file mentioned in .htaccess. You can download the PHP file here. This is a copy of the actual one I am using. Now to get a clean R session I just type the following:

  source("http://www.haptonstahl.org/R/Decruft/Decruft.R")

Easy peasy. Perhaps the best part is that I’m done. I never have to update or modify this if I want to source other public code in that GitHub repository. For example, without changes I can source

  http://www.haptonstahl.org/R/RoundBoundsNicely/RoundBoundsNicely.R

to get

  https://raw.github.com/shaptonstahl/R/master/RoundBoundsNicely/RoundBoundsNicely.R

Lesson to take home:

Putting code you reuse up on teh webz makes it easy for you to use it over and over instead of writing it over and over.
GitHub rocks, now even more since I can source my R code from the live GitHub versions.
Safety first, kids!

Is Algebra Necessary? Yes and No.

Stephen — Wed, 01 Aug 2012 04:17:29 +0000

Political scientist Andrew Hacker recently asked “Is Algebra Necessary?” and the response has, unfortunately, been predictable.

Those in society’s minority who did well in math courses are “shocked” at the suggestion that we change the typical math curriculum. The teaching may be “dismal” but algebra is a “foundation stone” in developing critical thinking skills. “It teaches one how to think.” It’s a little amusing but mostly disheartening to see folks who claim to support more challenging math standards fall back on strawman arguments, condescension, sarcasm and, my favorite, math errors in their arguments.

Those in society’s majority who did poorly in math tended to respond with relief at the suggestion of dropping algebra, although there are a few PMSD (post-mathematics stress disorder) victims whose career paths were altered by failing math and who still carry the associated baggage and resentment.

Let’s set aside the hysterics (“We are breeding a nation of morons“) and give both sides of this debate a fair shake, shall we?

Arguments in Favor of Eliminating Algebra as Courses Required for All

We definitely teach too much algebra and do so mindlessly, without considering whether it’s useful. As a teacher of math courses from arithmetic through calculus at a community college, I fought the losing fight to remove useless topics from the curriculum. For example, Cramer’s rule is a relic from the days before computers and is as practical as a slide rule, but trying to remove it from the required topic list elicited resistance and deep resentment from many of my fellow faculty. Hacker’s suggestion that we reconsider the requirement for so much symbolic manipulation is sensible.

While teaching algebra I tried to limit my syllabi to (1) the topics that would be used in later courses, (2) topics that might be useful outside the classroom, and (3) some examples of true beauty. I emphasized (1) but snuck in some of (2) and (3). Still, I did teach some material that would only be useful in later math courses and never in any kind of applied setting. We could easily cut more topics by curtailing the length of the required math sequence, at least for the math subjects taught.

Algebra is not the only way to teach disciplined thinking. One can teach precisely the same thinking skills while removing the abstraction that makes math seem useless and difficult to many students. One idea (not mine): we could integrate math beyond middle school into the science curriculum and use applications as motivation. Then there’s no need to learn how to apply math to story problems; the stories are the original problems. This also prevents folks from running around with “hammers” looking for “nails”. Unfortunately I am not sure how to get there from here; teachers of other subjects would have to cover the needed material and that would require revamping the way teachers are taught.

We don’t need to learn algebra to develop our intuition about rates of change, interest rates, probability, statistics, and other topics that typically follow algebra. There are ways (videos, interactive widgets, simulation using simple programming) to develop the intuition of calculus — often the only calculus needed in a job like medicine — without approaching it the rigorous, analytic, symbol-pushing way we typically do. Even for those who eventually will need algebra, we can teach more advanced symbol manipulation skills later as needed.

Nobody (well, almost nobody) is saying that learning algebra has no value whatsoever. However, as long as we have limited resources the pertinent question is, does algebra give us more benefit than spending that time elsewhere? I suggest that programming, statistics, and finance are better uses of most students’ time. Programming is how the nearly countless computers in our lives work; a basic understanding of how they do their magic would be invaluable. Statistics are essential for making sense out of the sea of information around us. Finance is challenging and vital for artists and engineers alike as long as they want to buy a home or save for retirement. If we remove the symbol-pushing exercises of algebra and replace that class time with simple programming, statistics, and finance, we’ll gain more than we’ll lose.

Arguments in Favor of Keeping Algebra a Courses Required for All (with occasional rejoinders)

Let’s be more specific about “algebra”. A first course (“Algebra I”) often includes basic linear algebra (lines, graphing them, solving systems of linear equations, and matrices) plus evaluating polynomial functions, graphing quadratic functions, and solving single quadratic equations. A second course (“Algebra II”) builds on this with lots of factoring polynomials, exponential and logarithmic functions, quadratic inequalities and other algebraic prep for calculus.

An Algebra I course like this is incredibly useful. The concepts generalize to every science, from physics to political science. With this foundation a student can learn the intuition behind calculus, statistics, and other tools that are useful to have seen but usually not useful to retain. I strongly recommend keeping Algebra I part of the core curriculum. With moderate resources, this material can be covered in middle school for most if not all students.

Algebra II is where common responses to “Why am I learning this?” jump the tracks.

“It builds your brain like exercising build muscle.” I used this one regularly; it’s true, but only a half truth. Programming, statistics and finance can do the same with the added bonus of being unquestionably practical.
“It helps students understand where more advanced math comes from.” Playing with simulations is even better for most students in understanding why more advanced math techniques work the way they do.
“It teaches structured thinking.” Programming is even better for that and it is easier for students to see what why structure matters. Mess up the structure and programs do odd things.

Algebra II is full of topics that you don’t need in order to understand the intuition of common useful advanced ideas; you need Algebra II when you will try to master more advanced ideas. I strongly recommend making Algebra II something that fewer students take.

Known Unknown

My recommendation to remove Algebra II from the universal curriculum is contingent on at least one assumption: Students who need more algebra will have sufficient time to learn it later. To see if this is true we would need to take average to bright students interested in technical fields and wait until college to teach them algebra. This is not common. Interestingly, I have some experience that is close to this: teaching math to political science graduate students. While there are a few notable exceptions, most political science undergraduates take very little math. Graduate students work very hard to learn the math necessary for advanced statistics and game theory, and generally they succeed. This evidence is circumstantial but shifts the burden of proof onto those who might argue college is too late to learn algebra.

Bottom Line: Question Mathematical Authority

Hacker thoughtfully asked a good question: are we teaching what we should be teaching? One cannot decry the educational establishment as ossified but resist any attempt to change for the better. Hacker may be throwing out the baby. My ideas might not be the best. What are your ideas? Let’s discuss it.

* This blogger is an award-winning teacher of math, statistics, and programming at the high-school through graduate levels, holds graduate degrees in math and political science, and works in the defense industry as a data scientist.

Review of “Building Data Science Teams” by DJ Patil

Stephen — Wed, 11 Apr 2012 01:34:21 +0000

Having recently committed myself to earning my living as a Data Scientist, I’ve been reading anything I can find to guide my self-education. So I just spent the last hour reading and mulling over DJ Patil‘s article/report Building Data Science Teams (BDST henceforth) which is available free from various outlets; I read the Kindle version. (Disclaimer: DJ is a friend and occasional drinking buddy.)

Patil takes up the necessary and generally thankless task of writing a “big-think piece”. It’s necessary because, with all the recent talk about Data Scientists, it would be easy for some to see “Data Science” as a recent entry in the long history of business fads. Business fads tend to indulge in one of two sins: oversimplifying, or being so general that anything counts. Both are sins to which Data Science advocates are susceptible. An advocate of Data Science might oversimplify by giving a recipe or IT shopping list for doing Data Science; Patil avoids this triteness by emphasizing the wide diversity of the good teams he’s built or worked with. Alternatively, one could sin by making anything count as Data Science. Commenter “Verbose” anticipates this problem with his erudite critique of BDST: “Same technical data analysis, new bullshit name“. Patil avoids this and provides a blueprint for others to do the same.

When discussing the etymology of “Data Scientist” Patil writes that “Research Scientist” would not be as appropriate a term for this profession:

“Research scientist” was a reasonable job title used by companies like Sun, HP, Xerox, Yahoo, and IBM. However, we felt that most research scientists worked on projects that were futuristic and abstract, and the work was done in labs that were isolated from the product development teams. It might take years for lab research to affect key products, if it ever did. Instead, the focus of our teams was to work on data applications that would have an immediate and massive impact on the business. (emphasis added) The term that seemed to fit best was data scientist: those who use both data and science to create something new.

Elsewhere Patil provides a solid definition of Data Scientist, but this paragraph encapsulates the concept just as well: A Data Scientist uses data and science to have an immediate and massive impact on the business. Just moving data around? Not data science. Have an impact in the vague future? Not data science. Improving entirely on the margins? Not (all of) data science. Holding the Data Scientist’s feet to the fire — asking “How does this immediately and massively impact our business?” — provides accountability and hence focus for the team.

Writing a “big-think piece” is also a thankless task: the breadth of the topic means that it’s easy for critics to find something to criticize as being presented too simply. This ignores the contribution of providing a view of the new discipline from space, showing all of it as a piece, and showing (if briefly) how the disparate parts fit together. I was glad for a look at a map for this road I’m traveling.

Overall BDST is short, shorter than I would have liked. (I’m glad that Patil is sharing more of his experience through other venues.) The advice Patil gives about Data Science, forging teams, and hiring Data Scientists seems both specific and useful; I’ll post again as I have occasion to use his advice.

Recommendation: “Building Data Science Teams” is short, but with enough good ideas as to be required for anyone in business intelligence, internal data analysis, or applied computational modeling and prediction.

Planned Serendipity

Stephen — Mon, 05 Sep 2011 20:22:21 +0000

Yesterday I got back from a great APSA in Seattle. My undergraduate students were despondent at me having to cancel class Thursday so I could attend. A few were curious about what happens at a scientific conference and asked about the structure. I explained that there would be several thousand political scientists at this conference and that most of the planned interaction would take place in panels.

In theory, a panel takes place in a room where 3-300 (median = 10) people watch three to five papers get presented by their authors. Then a discussant, who reads the papers in advance, comments on the papers both to draw connections among them and to stimulate conversation among the attendees. Then the audience asks questions and offers feedback to the authors. The whole panel takes about 1 3/4 hours.

Although panels comprise most of the scheduled events at a conference, they are not the best reason for scientists to attend conferences and they are far from the most rewarding part of a conference. Panels are often poorly attended. The papers in a panel often have very little to do with each other. The discussant may not receive the papers until days or moments before the panel, if at all, and even so the comments may focus more on typography than on big ideas.

Panels are a party game. They are an excuse to get smart people, who are interested in similar things, together in a room talking. Put a bunch of clever folks together and strange, wonderful, unpredictable things happen. A conference is mass planned serendipity.

The largest conference benefits to my research have happened when I have not been at panels: between panels, skipping panels, into the evening and the night. It’s the networking, but not “networking” in the Machiavelian, sales-person sense. It’s the comment on my paper that someone was a little too shy to offer in front of everyone, the comment that helps me recast the paper so it will place higher. It’s running into the same person at three panels and finally discovering we would love to work together on some research. It’s the dinner outing that leads to an invited talk or an interview. It’s the shared coffee followed up with a Facebook friending that leads to a new real friendship.

All of this is made possible by panels, but it’s not the direct result of the panels. So when someone tells me they didn’t go to a lot of panels, I understand that they probably got a lot of professional good out of the conference.

Also, I had a lot of fun at the Space Needle.

Bayes fixes small n, doesn’t it?

Stephen — Thu, 03 Mar 2011 17:09:29 +0000

What is a methods-careful practitioner to do when the number of observations () is small? I don’t know how many times I’ve been told by a well-meaning Bayesian some variation of

Bayesian estimation addresses the “small problem”

This is right and wrong.

Maximum likelihood estimators (MLEs) leverage large in two ways.

Making inferences about parameters.
Checking model fit.

The goal of #1 is to answer questions like “Is 0' title='\beta>0' class='latex' />?” or “What is the shortest interval that has a 95% change of containing ?” MLEs let us take a stab at answering* this. A large number of observations means we can apply the Central Limit Theorem, which means we can use the standard errors to test simple hypotheses and build confidence intervals. The goal of #2 is to justify our choice of model after the fact. Presumably we had a good, substantively-informed theory to justify our choice of model, but it’s nice to be able to say afterward “See? The models fits well, so we weren’t crazy to choose this model.”

How many is a “small” number of observations? It’s difficult to say. Long (1997) gives some guidance about this, suggesting that 100 is a minimal number for an MLE, and that one needs at least 10 observations per parameter in the model. However, we never really know if this is enough to be able to trust our inferences and fit.

What happens if we have too few observations? Both #1 and #2 become unreliable. We have too little data to assume that the Central Limit Theorem has “kicked in,” so our point estimates have more uncertainty, which means our standard errors are too small**, and therefore any inferences are unreliable. Our tests for model fit are similarly starved for information, so any post-hoc model justification will be difficult.

What does Bayes fix? Bayes estimators are finite-data estimators: more data gives more accurate estimates, but the measures of uncertainty of those estimates are reliable regardless of the amount of data. This means that inferences are reliable regardless of the size of the data set. Want to know if 0' title='\beta>0' class='latex' /> but our is small? No problem for a Bayesian.

What about model fit? Bayesians have the same tools as frequentists for checking model fit and they also have the numerical and graphical analysis of posterior predictive distributions. As with inferences about parameters, more data is more information and thus is preferable. However, any tests of model fit are still reliable. As a purely practical matter, if we have a very small data set we probably will not be able to conclude anything about model fit from the data. This puts the burden back on the substantive/theoretical argument for the model form. If we understand the data-generating process very well, great; if not, then this part of our argument will need extra scrutiny.

Bottom line: Bayesian estimators don’t create more information. However, they do let us correctly identify how sure we are about the inferences we draw. That’s a clear improvement. Other Bayesians aren’t helping by overselling it.

===

* Don’t get saucy with me about how frequentist confidence intervals (CIs) either contain or don’t contain , implying that it’s meaningless to talk about the probability of being in the interval. CIs are better than commonly described — details in my next post.

** In theory our standard errors could be too small, too large, or correct. We can appropriately account for this additional uncertainty by increasing the size of our standard errors (think “put a confidence interval around our confidence interval”) except we can’t know how much to increase them. See “reasoning, circular”.