Social Science Replicability Challenge

Data is a universal language we’re all beginning to understand and use, if the data is faulty, we aren’t really learning anything at all.

A great deal of what we know today comes from those who took the time to research, document and share their findings with the world. We’ve learned a lot through social science experiments, and sometimes when we are left between a rock and a hard place we dive into our knowledge and pull up a statistic or think back to something we’ve read before to help form a decision. This way of thinking does not just stop with personal decision making; societies and institutions within them also trust the foundation of research and the scientific process. Our school systems heavily rely on social scientific findings as well to dictate the most effective ways to teach school children and come up with the best practices on how they can retain information. If these school children decide to go to university, perhaps they will then carry on the legacy of research papers and experiments, repeating the cycle of knowledge. But what if a great deal of information we’ve always assumed to be true had flaws? What if the research did not measure up to a greater reality? What if a lot of what we think we know today, in the world of social science, isn’t all that accurate?


Unlike the arts, that encourage us to create something new and different each time, science is best proved through repetition. If we can repeat an experiment and be left with the same conclusion each time, we can deem this discovery as true. If we look at the table below from 2015, published by the Open Science Collaboration project, led by the University of Virginia psychologist Brian Nosek (Open Science Collaboration, 2015) we can see approximately only one third of psychology experiments from premier journals are able to be replicated [1]. Although Psychological Science had the highest number, 53% is still an extremely low success rate.

Table 1: The Reproducibility of Psychological Science

Table 1: The Reproducibility of Psychological Science

To understand why this phenomenon happens over and over, we must ask ourselves how the data is being collected in the first place. Traditionally, when conducting experiments researchers and scientists test how something effects a small number of people and conclude that whatever the result is can be said for the entire population. A lot of the time, the results from collecting data this way, just ends up being some kind of fluke. An example of this can be found in the famous “Marshmallow experiment”.

The Stanford Marshmallow experiment ("Cognitive and attentional mechanisms in delay of gratification"Journal of Personality and Social Psychology) was a series of studies between the 60’s to the early 70’s which attempted to test children on delayed gratification (Mischel, Walter; Ebbesen, Ebbe B.; Raskoff Zeiss, Antonette (1972) [2]. The idea of the experiment was to offer the children a small prize (like the marshmallow) and if they could go approximately 15 minutes without eating it they were told they would be rewarded with an even bigger prize (cookies, another marshmallow etc.) The assumption of the experiment was to show those children who waited had a higher level of intelligence and would go on to receive higher SAT scores and achieve overall higher success. To back up the original experiment, in the 90’s evidence showed the children who had a delayed gratification rate did receive higher SAT scores. Over the years, research papers referred to the experiment, schools placed strong priority on teaching children to learn to wait for their reward and even parenting books referenced the experiment. In school, I can remember learning about the experiment and being told the smart children who could wait a higher emotional intelligence. It was to my shock in 2012 when I read that the study was being replicated. (Marshmallow Test Revisited". University of Rochester. October 11, 2012) [3]. The original study did not specify strategic waiting versus the ability to wait. If the children did not trust the person making the promise, from prior experience, wouldn’t that make the experience completely different? In 2018, another follow-up test was conducted. This time proving once and for all that there was absolutely no correlation between delayed gratification and SAT scores because this time the experiment took in factors like family and home environment. (Watts, Tyler W.; Duncan, Greg J.; Quan, Haonan (2018). "Revisiting the Marshmallow Test") [4].

You might be thinking how some of the most educated experts in a field get the numbers so wrong. It’s important to note again that it happens one third of the time, but how? In the case of The Marshmallow Experiment, the researchers used the resources they had most available to them. The children they used in their experiment were attending Stanford’s on campus nursery school. Meaning, the parents of these children were more than likely Faculty and Students from Stanford. Well this changes everything. The children already had some advantage to obtain higher SAT scores because their parents themselves were in academia. In the new study, there were ten times more children tested, most of whom had parents who did not attend college.  

If you’ve experienced the life of a researcher, or you’ve been a student hearing your professors complaining about their newest paper, there is a significant factor in the field you may have noticed. Money is tight and only given when the researcher has earned it. It is not uncommon practice for anyone conducting experiments to use what is steadily available to them. College students have traditionally been asked to participate in research studies for extra grades. This is a smart way for researchers to conduct their experiments without having to spend money. This well-known practice in the field of research brings us to another flaw we find in data within the social sciences known as WEIRD. This acronym stands for Western, Educated, Industrialized, Rich, Democratic also known as the majority participants in most studies. A 2008 meta analysis of thousands of articles published over 20 years and found that around 95% of behavioral science research subjects come from the US, EU, and English-speaking countries [5]. This is only 12% of world’s population. The same study found 68% of the research subjects are from the US and two thirds are college students (aka WEIRD).

We must ask ourselves, if the research only embodies about a tenth of the population, how can we expect to get the proper results? In a research published by Cambridge University Press, researchers noted that the reason this is not an accurate representation is because those who live in other parts of the world display different “moral decision making, reasoning style, fairness, even things like visual perception” [6]. These are all traits we develop based on the nature of our upbringing and environment. The data in these cases become extremely generalized and inaccurate. In 2008, out of 100 published psychology studies fewer than half could be replicated [7]. WEIRD plays a large role in this, because any change of subjects in research has proven to drastically change results. 

Now, let’s say we’ve swerved through the replication crisis successfully and conducted an experiment. It’s time to publish a paper and present all our findings. We must first send this for peer review before anything can get the green light for publishing. The peer review process may seem wonderfully reliable, however, as summarized in an Economist article, the statistical nature of the scientific experiment may result in around 36% of positive results being falsely identified as correct [8]. This problem in research has also given leeway for bad data to slide through. William Tierney, M.D., a Regenstrief Institute investigator and study co-author says "Published research is becoming a more and more significant factor in scientific dialogue. Physicians and other researchers are no longer the only readers of medical studies. Patients and their families and friends now regularly access medical literature. This makes the review process even more important” [9]. Data is a universal language we’re all beginning to understand and use, if the data is faulty, we aren’t really learning anything at all. Therefore, it is extremely important to find alternative ways to access clean, non-bias, and ethical data.

With researchers catching onto the reality of the replicability crisis and having so many bad samples, out of desperation there have been instances where data is stolen. A few years ago, Google and The University of Chicago Medicine teamed up for a great cause, they wanted to improve the prediction in health care. The project relied heavily on machine learning, and because of this The University of Chicago ended up sharing a lot of patient data [10]. This incident left both the university and Google with a lawsuit. By law, medical practitioners may share their data if the identity of the patients are not disclosed. In the case of this lawsuit the plaintiff argued that because the records had time stamps attached, Google could use other data obtained through their databases to identify each individual medical record. In this case, although the research was for the greater good from two very respected sources, the way the data was obtained was considered unethical. This is what we have been left with in the research community. We want answers, yet we don’t have the answers on how to properly get them. This leaves us with the dilemma of corrupted data or unethical data. 

The urgency to find a way to collect good data for research paper is at an all-time high. When we have the wrong data, it gets communicated and turned into bad policy just like what happened with the Marshmallow experiment. If we want to ensure the best for our communities, we must get the proper answers to lead us to a much better life.