Why you should share your data

What I learned from asking people about their data, or talking about data sharing

Jan 11, 2024

In the past 4 years I have asked a number of people to share their data with me. Mostly to check the accuracy of their claims. But several times it was to further our own research on similar topics and to perform our own additional analysis. To be clear I am talking about data underlying publications, either in traditional journals or on arXiv. I have also spoken about sharing data broadly, in fact I do so in every one of my public talks. The conversations about data rarely surprise me, in fact they follow one of very few (3-4) trajectories. So I decided to write up a summary of the arguments I use for each type of response I get.

I am a condensed matter physicist, and the first thing I can tell you is that a majority of people in my community, especially the students and postdocs, understand the importance of sharing data. At the same time, there is a fraction of us whose attitude towards data sharing is … uninitiated. More often than I like I find myself in a situation where I broach the topic with a person who claims they are being asked about this for the first time in their adult life. My impression from other academic communities, not too far afield - for instance astrophysics - is that they share their data much more readily, and even foster a culture of doing so. There is nothing unique about my physics community, and there should be nothing culturally exceptional in how they react to data requests. But because the subject is just something that we do not promote a discussion about, people come up with fairly naive arguments for why they are hesitant to share their data. I imagine people had many silly arguments in the past about why they should not use soap or brush their teeth.

My main argument for why you should share your data is that without data available, your report, paper or claim you make in a presentation have no real value. A paper is just an illustrated essay, and a talk you give is a performance. Your actual primary product is your data, and any analysis steps you undertook starting with your data. This is what the secondary products like publications draw their worth from. By sharing data, you allow for verifiability, for un-skewing of your conclusions, for further analysis and reuse.

Why you should not share

That being said, there is one really good reason for you not to share data. It is if your data have never been there, if you made it up, if you manipulated it, cherry picked your data in extreme ways, if you don’t have enough data to justify your claims, if you know of major errors in your analysis and are not willing to address them. If any of these are the case, then indeed - you are better off resisting requests to see your data. The sad situation is that at the moment, there are no really good mechanisms to compel you to release your data. This may change in the near future, with new government agency regulation coming in effect. Who knows, perhaps more publishers will follow the example of PNAS and start retracting papers for which data requests are denied.

Excerpt from PNAS policy

There are also cultural changes happening. Already now, if it becomes known that you are not willing to share your data, you can bet that many people will think that you have something to hide, and that your results are not reliable in some way. I certainly do. So the tradeoff when you refuse to share data will be: you may stay out of tangible trouble, such as retractions and research integrity investigations, but you will also build up suspicion around your work. Keep in mind that refusal to share data will increasingly become its own research integrity concern, and it already is in some countries, such as the Netherlands and Denmark.

“They will twist my results and make them look bad”

One kind of reaction to a request to share data is that the original author does not trust you. They claim you are out to get them, catch them out on some inconsistency that is either made out of thin air, or has a simple explanation. No work is perfect and there is always something to nitpick. The idea is that people ask you for your data to use it in an unjust and personal attack against you or your colleagues.

I cannot say this never happens, but certainly the frequency at which people claim injustice towards themselves is too high. More often than not, the data request is a form of scientific challenge that the receiver chooses to interpret as a personal attack - perhaps assisted by the request not being sufficiently polite. Sometimes, the data are in fact bad and the author is worried that this will be exposed. Other times the author is caught off guard by an unusual request that they never had to deal with due to others mostly trusting their claims. Their reaction is to suspect that you have a personal agenda.

But even if you are concerned that you are under attack from an actual troll that is out to get you in the most unfair way possible, you should still share your data. If your work is solid, then the data will speak for itself. It is very hard to make up an untrue critique. If someone writes an untrue comment criticizing your work, you will be able to explain what they got wrong. If they find a minor error, you should be happy and grateful. If they find a major problem, you should be even more grateful, though it is also fine to feel somewhat embarrassed.

Besides, the current system has layers upon layers of protection for the claim makers, and obstacle after obstacle for people challenging somebody else’s work. In my opinion the system is off balance and should be modernized. But the way things stand it is very hard to take down a published study even if the takedown is justified, and impossible to do so if the study is valid. There are simply no examples of that happening in our field, and certainly not enough to justify withholding data that can validate published claims.

“Other people published similar claims, so no need to see our data”

This is just silly. Where to begin? If your work is reliable and reproducible, and this has already been established extensively by others, then what is your worry about sharing your data? It will only make your results more verifiable, more useful. If people are already following up on your work in their own labs, then giving them more information will only enhance the impact of your own findings.

On the other hand, there is a well-known phenomenon called ‘confirmation bias’. This is when you make a claim, and others take it as a cue that this is what their results should look like. Sometimes they already had similar looking data, but did not think much of it. Yet after reading your paper they decide that they can make their own claim that mirrors yours. In one example, the authors found some “quantized plateaus” following a paper that was later retracted. There are also examples of how people jumped the gun, and claimed a discovery that was physically plausible, but that was only made for real later.

More generally, if one group reaches a finding following another group's lead, it does not mean that the first group was correct, or completely correct. Experiments are never identical. Sometimes different techniques are used to explore similar questions, different analysis is performed, different materials are used. So a confirmation is rarely very clean. A variation on this theme comes when your own work reproduces your earlier work, and the data request is about the first one. If you have not shared data from either study, then both are equally unconfirmable. Each of your works just needs to be able to stand on its own, and not rely on subsequent works by yourself or others.

“Our data are not presentable”

Some react to data requests with mild shame because they feel that their data is not organized well enough for others to see it. Or perhaps it is in an archaic format, not properly described. A related issue is if they were not keeping track of their data, and are not sure that it is still there. I even heard an argument that data could not be shared to protect student privacy, because they are making personal notes in the lab journals, writing down things that reveal gaps in their understanding. You imagine others looking at your materials and judging you, like people often do with other people’s code.

To that I say: share it anyway! I agree we should all put more effort into making data “findable, accessible, fairable, …”. But the truth is, most of our data is not organized well enough. So we are all in the same boat. Despite this, I direct my group members to share all their data not waiting for requests, but at the time of arXiv publication. I do plan to develop better data organizing standards, but in the meantime, I am fairly certain that our data is not particularly embarrassing to share, because it is typical. There is a much greater benefit to making it public than if I were hiding it until we find time to reorganize it. In fact, the one repository we were trying to arrange very neatly remains unpublished for several years… If someone gets interested in our data, we will work with them to help them read it, plot it and answer any of their questions. This already happened a couple of times with the data we published.

“I don’t know what to share, there is too much data!”

The first thing I would ask you back is - how much is too much? If it is less than 50 Gb per experiment, then it will all fit into one single Zenodo record, already now. And the file limits will likely grow in the future. The days when arXiv would only give you a couple megabytes per record are long gone! You could simply share your entire data. And this is great because you do not need to curate it at all, so you spend zero time figuring out what you should share.

If you have substantially more data than tens of gigabytes per project, which could be the case with some synchrotron measurements, or numerical simulations - then you could share the code you used to process your data down to make your paper figures, along with some examples of the original data. See this article about how CERN does it, they have way too much data to even store it all, so they publish their process.

“No one will be able to follow all my data!”

Chances are, you and your collaborators already did some curation as you were sorting through the data, so you could add those summaries, writeups, powerpoint presentations to your record, and those will be an excellent map to how you did your work and what you found the most interesting. You surely have records of what experiments you ran that you took the data on, that you used to keep track. Share these as well.

“Some of my data is garbage anyway”

A related concern I hear is that some of the data is calibration data, “garbage data” that nobody should be looking at, and it can be simply ruled out as anything useful. Here you should be careful. If by garbage data you mean measurements with the cable unplugged, or a broken STM tip, I agree - no need to share it. But if these are measurements from a sample that appears to be alright, and they simply do not show the effect you were looking for, or it looks worse than in the sample you deem your best - then you cannot justify removing those data. They are very valuable for your peers to see. It would give them a proper impression of your full study, and help overcome the very common positive bias in your analysis. I say this as a person who spent years and directed students and postdocs to follow-up on the work of others which turned out to be a wild goose chase because the claims were based on a single working nanowire, and most of the experiments failed but that was never reported.

“I am still publishing more analysis on the same data”

This goes to the worry of being scooped if you share too many details of your work. Somebody else may even perform the analysis of your own data that you were planning on doing, ahead of you. Personally I am not bothered by this, and I would even be thrilled if it happened to our work. It is why I insist on sharing our data: I would like there to be more and better research on my topic and building on my group’s work.

But if you have definite plans to publish a series of papers using the same set of data, and only share data when that is done, I would reluctantly agree that this is reasonable. You may also have administrative restrictions on data sharing, such as if you are filing a patent, or it is impacting national security - but these are very rare situations. In general, if you already published a paper or filed for a patent, this means you can safely share data. As a basic principle, I think that full data underlying any paper should be available as soon as that paper is published, starting with the publication on arXiv. Other researchers will start interacting with your work from that moment, and they should not wait for you to complete other studies, which can take years.

“I will share upon a reasonable request”

“Data available upon reasonable request” is really a meaningless sentence, in my opinion. Perhaps you added it to your paper automatically, as one of the default and obligatory phrases at the end of your manuscript. In this case, I encourage you to spend a few minutes and define for yourself, what for you is a reasonable request? Or, go from the opposite - what would be an unreasonable request? When I did this, my answer was - there is no such thing as an unreasonable request. Because all of my data from the recent papers is shared, there is no burden involved in sharing it, only in guiding people through the repositories. If you think the requests can be burdensome, this could be a wake-up call for you to put some effort into locating and structuring some of your data. For instance, if you are a PI and you don’t know where your students and postdocs keep the data, you should get on top of this.

You may be concerned that a request is frivolous. Perhaps you imagine some data scavenging researchers that just go around asking everyone for their data like bots that scavenge the internet. I don’t know if they exist, but I have never met one.

If you did this exercise and can define what a reasonable data request is, I would love to hear it! As long as it is not “I will know if it is reasonable or not when I see it” because this gives you full power to reject any data sharing request. And instead, you should just share your data. So if you do add the “reasonable request” statement to your paper, try to think through what would you do if and when a request comes? Which data will you share? How will you do it? How long will it take you?

Sergey’s Substack

Why you should share your data

What I learned from asking people about their data, or talking about data sharing