Virtue signalling

What is the opposite of a Like on social media? These days, perhaps the closest thing is an accusation of "virtue signalling." Given the current age, one of recursive outrage hurtling to and fro through online conduits at ever increasing frequency, someone was sure to try to claim higher moral ground through accusations of lazy armchair posturing.

James Bartholomew laid claim to coining the phrase in a self-congratulatory article which seems so smug in tone that perhaps he was trying to head off even a hint of faux modesty that might be interpreted as virtue-signalling (the subhead reads "It’s a true privilege to have coined a phrase – even if people credit it to Libby Purves instead").

To my astonishment and delight, the phrase ‘virtue signalling’ has become part of the English language. I coined the phrase in an article here in The Spectator (18 April) in which I described the way in which many people say or write things to indicate that they are virtuous. Sometimes it is quite subtle. By saying that they hate the Daily Mail or Ukip, they are really telling you that they are admirably non-racist, left-wing or open-minded. One of the crucial aspects of virtue signalling is that it does not require actually doing anything virtuous. It does not involve delivering lunches to elderly neighbours or staying together with a spouse for the sake of the children. It takes no effort or sacrifice at all.
 
Since April, I have watched with pleasure and then incredulity how the phrase has leapt from appearing in a single article into the everyday language of political discourse.
 
...
 
I bumped into Dominic Lawson, former editor of The Spectator, who remarked that my life is now complete: I have added to the English language and can retire from the scene, perfectly satisfied. I have reluctantly given up hopes of ever appearing on Desert Island Discs — a pity considering I have been preparing for it for some 35 years — but at least I can comfort myself that I have coined a phrase. I thus join, admittedly at a low level, the ranks of word-creators such as William Shakespeare (‘uncomfortable’ and ‘assassination’ and many others) and Thomas Carlyle (‘dry as dust’ and, most famously, ‘environment’).
 

Given the culture world war that is 2017, last week the NYTimes published an essay on virtue signalling. The implications of the term seem fairly self-evident, but for those who are new to the phrase, the piece provides a primer.

When people offer their vehement condemnation of some injustice in the news, or change their Facebook profile photos to honor the victims of some new tragedy, or write status updates demanding federal action on climate change, observers like Bartholomew smell something fishy: Do these people really care deeply about the issue du jour? They probably aren’t, after all, out volunteering to solve the problem. What if they’re motivated, above all else, by simply looking like people who care?
 
This sort of ostentatious concern is, according to some diagnoses, endemic to the political left. A writer for the conservative website The Daily Caller wrote this summer that virtue signaling ‘‘has been universalized into a sort of cultural tic’’ on the left, ‘‘as compulsive and unavoidable as Tourette’s syndrome.’’ There are plenty on the left who might agree. It’s not difficult to find, in conversations among progressives, widespread eye-rolling over a certain type of person: the one who will take a heroic stance on almost any issue — furious indignation over the casting of a live-action ‘‘Aladdin’’ film, vehement defense of Hillary Clinton’s fashion choices, extravagant emotional investment in the plight of a group to which the speaker does not belong — in what feels like a transparent bid for the praise, likes and aura of righteousness that follows.
 
The charge of virtue signaling, though, has metastasized well beyond this type of comical figure. Once you’ve decided this ‘‘cultural tic’’ has become universal on the left, almost any public utterance of concern becomes easy to write off as false — as mere performance. It applies when people express dismay that a robotics team made up of Afghan girls may be barred from entering the United States; when someone frets about the American poverty rate; when The Associated Press shares information about a deadly oil-tanker fire in Pakistan. Every one of these things has been described online as the unholy product of ‘‘virtue signaling.’’
 

Of course, accusing another person of virtue signalling is its own form of virtue signalling. When I made reference to claiming moral high ground earlier, I should've been more clear. Whe applied to online arguments, moral high ground really means people taking turns sliding a sheet of paper under their feet in succession, ending with both sides about an inch off the ground.

My internet was physically disconnected by mistake last week and I spent a week largely offline, and in the few days since it's been turned back on I've returned back into the aftermath of the Google memo, the day or two we could afford on the North Korean nuclear weapon debacle, and then headlong into Charlottesville. All serious topics, all deeply troubling, but it's the online discourse around them which has quickly destroyed any accumulated peace of mind from my brief internet vacation.

W.B. Yeats' poem "The Second Coming" is never far from my mind these days, so spot on it is when applied to current online discourse.

Turning and turning in the widening gyre   
The falcon cannot hear the falconer; 
Things fall apart; the centre cannot hold; 
Mere anarchy is loosed upon the world, 
The blood-dimmed tide is loosed, and everywhere   
The ceremony of innocence is drowned; 
The best lack all conviction, while the worst   
Are full of passionate intensity. 
 

Much of social media, but in particular Twitter, should be regarded as Scolding as a Service. Unfortunately SaaS has already been claimed as an acronym, but it's not too late to tout this moat as a unique feature on their next quarterly earnings call. You can go to any old social media service for some sweet, sweet confirmation bias, but if you want to be scolded repeatedly and on demand, no service can beat Twitter.

I'm not going to spend much time rehashing the usual arguments on virtue signalling. By traditional signaling theory, much of online signalling, not just instances of moral indignation, is weak by its very nature.

One of the core tenets of signalling theory is that the best and strongest signals are the costliest ones, the canonical example being the peacock's tail. Human equivalents abound; if you drive a half million dollar Ferrari convertible down a busy thoroughfare, your message gets across clearer than if you're driving a $70,000 BMW. Since so much that is done online is inherently low cost, online signals are going to suffer from an amplitude problem in general.

[Some claim that the casual dress among Bay Area billionaires is some variant of that theory of costly signals, but I consider it to be the same; the costly signal there is the demonstration of power in disregarding fashion norms. You have such reputational capital that you need not even resort to traditional signals like nice clothes, like some normie.

It is surprising that ways of attaching verified cost or Talebian "skin in the game" to one's online signals hasn't been tried online. Perhaps an avatar change that can't be made for free but can only be purchased through a donation to some charity, almost like a virtual outfit in some MMORPG. Occasionally someone will match donations to a charity, which is similar, but one of these social networks with an economist on staff is sure to suggest a platform solution at some point.]

This long detour on virtue signalling brings me back to the VC sexual harassment revelations earlier this year. It wasn't that long ago and already it seems like a scandal from another age.

I wrote about the issue from the angle of mutual knowledge becoming common knowledge. In the wake of one woman after another coming forward with their stories of being harassed by various Silicon Valley investors, many in the tech community expressed outrage, and like a moral gag reflex, many of those who expressed outrage were hit with accusations of virtue signalling.

Whether or not you believe those who joined the chorus of outrage when the scandals broke, what they were doing in that context serves an entirely different and important signaling function.

Recall that until the story about Justin Caldbeck broke, many women had held back for years on sharing their own stories, many out of concern they wouldn't be believed, that they might be blackballed by the largely white male investing elites of Silicon Valley. Based on the names of those investors who acknowledged and corroborated the stories of various accusers, the women were right to be concerned.

In fact, many people, myself included, had to update our priors about the incidence of such sexual harassment, and the types of people who might commit such acts. Some who took a fall from grace were highly respected, smart, well-known investors, and the news that yet other stories of harassment might be buried by non-disparagement clauses meant that many had to recalibrate their priors upward even more. The Google memo was a similar issue that had people updating their priors as the volume of visible support both inside and outside the company for Damore took many by surprise.

When a whole lot of people are rapidly updating their priors, signalling where you stand, whether it's virtuous posturing or not, can serve another purpose. It can help people to clarify where you lie on the distribution in question.

Sorry white male investors accusing others of virtue signalling, it may feel silly to have to publicly declare that you're not going to harass the next woman (especially an Asian woman) entrepreneur that you come into contact with, but after hearing so many stories of harassment from such a wide variety of white male investors, many in the community honestly have no idea which of you are prone to such behavior. Clearly, identifying those of you who are wasn't as simple as identifying, say, a white supremacist, who might be Sieg Heiling or waving a Confederate or Nazi flag in public. The sexual harassers didn't have any such villainous mustache or common identifying feature other than being white men. If the signs were clearer, those stories wouldn't have made for such explosive news.

Signaling for one side or the other to help people establish proper priors really matters when it comes to sexual harassment. The more female entrepreneurs believe that the majority of investors are going to give them a fair shake, rather than try to exploit the inherent leverage in the investor-entrepreneur relationship, the more those entrepreneurs will feel safe raising money and calling out bad behavior when it does occur.

In other times in history, having proper priors was a matter of life or death. So for white male investors, to take the example at hand, it could certainly be worse. You could be black, and have to signal that you're not a criminal every day you walk around in public, for fear of being arrested or worse, shot. You could be female and have to signal every day of your life that you're not passive, that you're technically capable of doing your job. For most of history, being white and a man has been the default, meaning that those lucky enough to be in that group have had no socially inherited identity debt to manage or pay down.

More and more, white men, and white people, are being treated as a distinct segment, with their own cultural brand, rather than as the default. Some of this is by choice, some of it is exogenous pressure.

The transition won't be easy. It never is, because it is difficult to notice the absence of something. It's easier to detect if you reverse your surroundings. When I travel to a place like Taiwan, where the majority of people around me share my ethnicity, I feel a bit like Kal-El landing on Earth, a planet with much lower gravity than my home. I feel a weight lifted off of me. The journey many white men are taking now is the reverse.

In the trailer for the next Justice League movie, Barry Allen, the Flash, turns to Bruce Wayne at one point and asks, "What are your superpowers again?"

Ben Affleck, in what will likely be the best line in the entire movie, responds, "I'm rich."

He should have said, "I'm white." It may be suffering a bit of depreciation recently, but it's still just about the most effective signal going.

Que sera sera

Statisticians love to develop multiple ways of testing the same thing. If I want to decide whether two groups of people have significantly different IQs, I can run a t-test or a rank sum test or a bootstrap or a regression. You can argue about which of these is most appropriate, but I basically think that if the effect is really statistically significant and large enough to matter, it should emerge regardless of which test you use, as long as the test is reasonable and your sample isn’t tiny. An effect that appears when you use a parametric test but not a nonparametric test is probably not worth writing home about [2].
 
A similar lesson applies, I think, to first dates. When you’re attracted to someone, you overanalyze everything you say, spend extra time trying to look attractive, etc. But if your mutual attraction is really statistically significant and large enough to matter, it should emerge regardless of the exact circumstances of a single evening. If the shirt you wear can fundamentally alter whether someone is attracted to you, you probably shouldn’t be life partners.
 

A statistician argues you shouldn't be nervous on a first date. This sounds like math for “if it's meant to be, it will happen.”

Chinese robber fallacy

Given the recent discussion of media bias here, I wanted to bring up Alyssa Vance’s “Chinese robber fallacy”, which she describes as:
 
..where you use a generic problem to attack a specific person or group, even though other groups have the problem just as much (or even more so)
 
For example, if you don’t like Chinese people, you can find some story of a Chinese person robbing someone, and claim that means there’s a big social problem with Chinese people being robbers.
 
I originally didn’t find this too interesting. It sounds like the same idea as plain old stereotyping, something we think about often and are carefully warned to avoid.
 
But after re-reading the post, I think the argument is more complex. There are over a billion Chinese people. If even one in a thousand is a robber, you can provide one million examples of Chinese robbers to appease the doubters. Most people think of stereotyping as “Here’s one example I heard of where the out-group does something bad,” and then you correct it with “But we can’t generalize about an entire group just from one example!” It’s less obvious that you may be able to provide literally one million examples of your false stereotype and still have it be a false stereotype. If you spend twelve hours a day on the task and can describe one crime every ten seconds, you can spend four months doing nothing but providing examples of burglarous Chinese – and still have absolutely no point.
 
If we’re really concerned about media bias, we need to think about Chinese Robber Fallacy as one of the media’s strongest weapons. There are lots of people – 300 million in America alone. No matter what point the media wants to make, there will be hundreds of salient examples. No matter how low-probability their outcome of interest is, they will never have to stop covering it if they don’t want to.
 

A fantastic and important post by Scott Alexander of the great Slate Star Codex: Cardiologists and Chinese Robbers.

This is why I'm so suspicious of anecdote-based journalism, especially when it comes from an outlet with a hallowed reputation. Think back to the piece on Amazon working conditions in the NYTimes, and see how much actual data backs up some of the generalizations made in the piece. I'm not saying that the individual stories of terrible managers don't matter, because each of those in and of themselves was terrible and worth deep investigation.

Many people I know just take it for granted that it's like that throughout the company, though. Take this op-ed from Joe Nocera. He felt comfortable enough, after reading that piece, to make sweeping statements like this:

It’s an enormously adversarial place. Employees who face difficult life moments, such as dealing with a serious illness, are offered not empathy and time off but rebukes that they are not focused enough on work. A normal workweek is 80 to 85 hours, in an unrelenting pressure-cooker atmosphere.
 

I will bet Joe Nocera his net worth that the average workweek at Amazon is not 80 to 85 hours. I don't think any company in the world with over 170,000 employees has an average work week approaching anywhere near 80 to 85 hours. But hey, it's just a NYTimes op-ed, let's just throw a crazy fact like that out there with no sourcing whatsoever, who's going to fact-check an op-ed anyhow?

What 170,000 employees and who knows how many former employees provides a reporter is a lot of people to mine for Chinese robbers.

[Incidentally, that large a sample should also provide plenty of counter-examples, but Amazon's restrictive, and in my opinion, short-sighted social media policy prevents folks like that from speaking out. One employee couldn't take the piece lying down and wrote a rebuttal on LinkedIn, and later other former employees came out in the company's defense, including one who felt her story was used in the piece in a misleading way. It doesn't have to work just in the company's favor, other stories like this one have come and added to some of the terrible anecdotes in the original NYTimes piece. However, since the social media policy restricts current employees from speaking out, it likely mutes the largest population of people who enjoy working there.]

I don't mean to wade back into the Amazon debate with this piece, and parts of it, even if rhetorically framed with bias, struck me as reasonably accurate. It just happens to be the most prominent recent example of Chinese robber fallacy that came to mind. Anyone who's been the subject of an anecdote-based journalistic piece should be suspicious of such pieces, yet so many people in and outside of tech took the Amazon piece as gospel.

The fact is, the Chinese robber fallacy really works. It must be so satisfying, as a reporter, to come across a source willing to go on the record with a dramatic narrative, even if it isn't statistically significant. That source also has spent their life looking for narrative patterns, and soon it's Chinese robbers all the way down.

Humans are wired to respond to narratives, to draw conclusions based on insufficient data. We're all looking for narrative shortcuts to the truth. When reporters give us a few carefully chosen examples, it's game over, regardless of whether or not it's a statistically significant sample, or whether or not the sample was plagued by selection bias.

Such journalism can be moving and hugely important. It can move people's hearts, and that's often what's needed to change the world. But it's also a dangerous weapon. Recall Janet Malcolm's opening line to her classic piece “The Journalist and the Murderer”:

Every journalist who is not too stupid or full of himself to notice what is going on knows that what he does is morally indefensible.
 

She meant it in a different context, but it echoes here.

Journalism with lots of data and statistics aren't sexy. They may not even require as much legwork as interviewing lots of people over long period of time, and it's not the type of journalism that gets dramatized in the movies. But there's a reason that science isn't based on a few good stories.

Why the world is getting weirder

It used to be that airliners broke up in the sky because of small cracks in the window frames. So we fixed that. It used to be that aircraft crashed because of outward opening doors. So we fixed that. Aircraft used to fall out of the sky from urine corrosion, so we fixed that with encapsulated plastic lavatories. The list goes on and on. And we fixed them all.

So what are we left with?

As we find more rules to fix more things we are encountering tail events. We fixed all the main reasons aircraft crash a long time ago. Sometimes a long, long time ago. So, we are left with the less and less probable events.

We invented the checklist. That alone probably fixed 80% of fatalities in aircraft. We’ve been hammering away at the remaining 20% for 50 years or so by creating more and more rules.

We’ve reached the end of the useful life of that strategy and have hit severely diminishing returns. As illustration, we created rules to make sure people can’t get in to cockpits to kill the pilots and fly the plane in to buildings. That looked like a good rule. But, it’s created the downside that pilots can now lock out their colleagues and fly it in to a mountain instead.

From a great piece by Steve Coast on why the world is getting weirder. Follow the Pareto Principle long enough and you fix all the low-hanging fruit with a whole bunch of rules, leaving just the black swans unaccounted for.

Anyone who has worked on any tech product or service long enough, through many cycles, knows you can end up working on just edge cases. Often, when you hit this point, you're listening to a sliver of power users and are at the point of such diminishing returns that accommodating them might be counterproductive as a whole. All you do by adding that random feature they want is add some interface overhead and friction for majority of your users, whose problems you already solved.

At this point, if the user base is large and healthy enough, most smart and ambitious companies move on to launching new products and services with higher marginal returns on their resources 3 . The resource vacuum is often exacerbated by the fact that the most ambitious employees would rather work on the new new thing. So they move on to the latest hot top secret project, leaving the former product or service in a maintenance mode, with minimal oversight.

  1. Large and successful multi-product companies that reach this point often just kill off the product or service if the user base isn't large enough. Think Google Reader or Apple's Ping. A startup that reaches that point often pivots, sells themselves, or folds.

It's usually the right near-term economic thing to do, but it can also leave some widely used products or services with chronic issues or imperfections that puzzle users and outsiders. How, they wonder, can a company with thousands of employees not bother to fix such longstanding and seemingly trivial issues? This is why competition is healthy, even if sometimes it seems like we have too many redundant products/services in tech.

Coast's post also includes some good career advice.

On a personal level we should probably work in areas where there are few rules.

To paraphrase Peter Thiel, new technology is probably so fertile and productive simply because there are so few rules. It’s essentially illegal for you to build anything physical these days from a toothbrush (FDA regulates that) to a skyscraper, but there’s zero restriction on creating a website. Hence, that’s where all the value is today.

If we can measure economic value as a function of transactional volume (the velocity of money for example), which appears reasonable, then fewer rules will mean more volume, which means better economics for everyone.

Bayes's Theorem

This is from 2012 but is still a great overview of Bayes's Theorem which really doesn't age.

Bayes’s theorem wasn’t actually formulated by Thomas Bayes. Instead it was developed by the French mathematician and astronomer Pierre-Simon Laplace. 

Laplace believed in scientific determinism — given the location of every particle in the universe and enough computing power we could predict the universe perfectly. However it was the disconnect between the perfection of nature and our human imperfections in measuring and understanding it that led to Laplace’s involvement in a theory based on probabilism.

Laplace was frustrated at the time by astronomical observations that appeared to show anomalies in the orbits of Jupiter and Saturn — they seemed to predict that Jupiter would crash into the sun while Saturn would drift off into outer space. These prediction were, of course, quite wrong and Laplace devoted much of his life to developing much more accurate measurements of these planets’ orbits. The improvements that Laplace made relied on probabilistic inferences in lieu of exacting measurements, since instruments like the telescope were still very crude at the time. Laplace came to view probability as a waypoint between ignorance and knowledge. It seemed obvious to him that a more thorough understanding of probability was essential to scientific progress.

The Bayesian approach to probability is simple: take the odds of something happening, and adjust for new information. This, of course, is most useful in the cases where you have strong prior knowledge. If your initial probability is off the Bayesian approach is much less helpful.

Includes a link to Eliezer Yudkowsky's intuitive explanation of the theorem and this Quora response to the question “What does it mean when a girl smiles at you every time she sees you?” which are both excellent.

A Bayesian approach to life is a sensible one, but the human mind isn't optimized to apply the theory accurately except at the broadest of levels (most people's intuition is way off when it comes to the mammogram example used in both the overview and the Yudkowsky piece linked above). This can be particularly problematic when it comes to our judgments of other people; we overweight new information without considering the prior odds. This is exacerbated by the internet, where we are prone to judge others on the select few pieces of content they choose to post for public consumption.

Give kickers the boot

Benjamin Morris notes that the consistent improvement in NFL placekicker accuracy across the years means we need to update our fourth down strategy cards.

If you’re reading this site, there’s a good chance you scream at your television a lot when coaches sheepishly kick or punt instead of going for it on fourth down. This is particularly true in the “dead zone” between roughly the 25- and 40-yard lines, where punts accomplish little and field goals are supposedly too long to be good gambles.

I’ve been a card-carrying member of Team Go-For-It since the ’90s. And we were right, back then. With ’90s-quality kickers, settling for field goals in the dead zone was practically criminal. As of 10 years ago — around when these should-we-go-for-it models rose to prominence — we were still right. But a lot has changed in 10 years. Field-goal kicking is now good enough that many previous calculations are outdated.

...

But more importantly, these breakdowns allow us to essentially recalculate the bot’s recommendations given a different set of assumptions. And the improvement in kicking dramatically changes the calculus of whether to go for it on fourth down in the dead zone. The following table compares “Go or No” charts from the 4th Down Bot as it stands right now, versus how it would look with projected 2015 kickers8:

My problem with field goal kicking is that it's boring. It's nothing at all like the rest of football. I dislike any sport which suddenly morphs into something else entirely, something worse, near the end of the contest, when things should be at their most tense and dramatic.

In basketball, a fluid, fast-paced game often ends with one foul after the other, forcing 9 world-class athletes to stand around while one guy shoots free throws. In football, if teams aren't just running the clock out or kneeling down at the end of the game, they're often lining up for a field goal, a specialized craft that has nothing to do with running, throwing, or catching the football. It's as if a tennis match that went to a tiebreak were settled by having the two players go to the sideline, replaced by two random people coming in to settle matters by playing Cornhole. I'd just as soon do away with field goal kicking in football and have teams go for it on fourth down all the time.

This is one advantage for baseball. To finish off the game, you have to get batters out just like you had to for the previous innings in the game.

Control

Really great piece at Vox on how you can over-control tests to the point where the thing you're trying to detect is controlled away in a misleading way. 

Statistical controls are great! Except when they're not.

The problem with controls is that it's often hard to tell the difference between a variable that's obscuring the thing you're studying and a variable that is the thing you're studying. 

An example is research around the gender wage gap, which tries to control for so many things that it ends up controlling for the thing it's trying to measure. As my colleague Matt Yglesias wrote:

The commonly cited statistic that American women suffer from a 23 percent wage gap through which they make just 77 cents for every dollar a man earns is much too simplistic. On the other hand, the frequently heard conservative counterargument that we should subject this raw wage gap to a massive list of statistical controls until it nearly vanishes is an enormous oversimplification in the opposite direction. After all, for many purposes gender is itself a standard demographic control to add to studies — and when you control for gender the wage gap disappears entirely!

"The question to ask about the various statistical controls that can be applied to shrink the gender gap is what are they actually telling us," he continued. "The answer, I think, is that it's telling how the wage gap works."

It's a difficult chicken and egg problem, very relevant to studies of racism in police enforcement.

Imagine applying these controls to society itself. We still have race, but people of all races have the same amount of money, and they live in the same kinds of neighborhoods, and they do the same kinds of drugs, and they even drive the same kinds of cars. That society would be a lot less racist. But part of the reason we're so far from that society is racism. Discrimination perpetuates itself.

In some ways, what's amazing about many of these studies is that they show a racial effect even after controlling for so much of racism's work. They show that racism exists even in our control society — the one with equality of income, and education, and neighborhood, and car choices. The one where we've wiped out most every difference but pigment. The one where we've left ourselves no excuses for our prejudice. It is remarkable how much discrimination can survive.

Read through Harold Pollack's emailed thoughts at the bottom of the piece.

Multiple testing

One of the potential pitfalls that arises now that it's easier and easier to test hundreds of variables to try to find correlations is the problem of multiple comparisons or multiple testing

The term "comparisons" in multiple comparisons typically refers to comparisons of two groups, such as a treatment group and a control group. "Multiple comparisons" arise when a statistical analysis encompasses a number of formal comparisons, with the presumption that attention will focus on the strongest differences among all comparisons that are made. Failure to compensate for multiple comparisons can have important real-world consequences, as illustrated by the following examples.

  • Suppose the treatment is a new way of teaching writing to students, and the control is the standard way of teaching writing. Students in the two groups can be compared in terms of grammar, spelling, organization, content, and so on. As more attributes are compared, it becomes more likely that the treatment and control groups will appear to differ on at least one attribute by random chance alone.
  • Suppose we consider the efficacy of a drug in terms of the reduction of any one of a number of disease symptoms. As more symptoms are considered, it becomes more likely that the drug will appear to be an improvement over existing drugs in terms of at least one symptom.
  • Suppose we consider the safety of a drug in terms of the occurrences of different types of side effects. As more types of side effects are considered, it becomes more likely that the new drug will appear to be less safe than existing drugs in terms of at least one side effect.

In all three examples, as the number of comparisons increases, it becomes more likely that the groups being compared will appear to differ in terms of at least one attribute. Our confidence that a result will generalize to independent data should generally be weaker if it is observed as part of an analysis that involves multiple comparisons, rather than an analysis that involves only a single comparison.

For example, if one test is performed at the 5% level, there is only a 5% chance of incorrectly rejecting the null hypothesis if the null hypothesis is true. However, for 100 tests where all null hypotheses are true, the expected number of incorrect rejections is 5. If the tests are independent, the probability of at least one incorrect rejection is 99.4%. These errors are called false positives or Type I errors.

A recent NBER paper argues that this problem invalidates most finance papers claiming to have found some formula for investing success. The abstract:

Hundreds of papers and hundreds of factors attempt to explain the cross-section of expected returns. Given this extensive data mining, it does not make any economic or statistical sense to use the usual significance criteria for a newly discovered factor, e.g., a t-ratio greater than 2.0. However, what hurdle should be used for current research? Our paper introduces a multiple testing framework and provides a time series of historical significance cutoffs from the first empirical tests in 1967 to today. Our new method allows for correlation among the tests as well as missing data. We also project forward 20 years assuming the rate of factor production remains similar to the experience of the last few years. The estimation of our model suggests that a newly discovered factor needs to clear a much higher hurdle, with a t-ratio greater than 3.0. Echoing a recent disturbing conclusion in the medical literature, we argue that most claimed research findings in financial economics are likely false.

Gaze deeply enough into the noise and you'll see some pattern.

[via Vox]

RELATED: Spurious correlations