Most researchers broadly support the idea behind open data—that public access to raw data would accelerate science by putting it in the hands (and minds) of many. Yet most are still reluctant to post their research results online. They cite lack of time, money and universally agreed upon standards, as well as technical barriers, as the main reasons they hold data back. Of course, there are psychological and cultural reasons, too: a sense of ownership over such a hard-won resource and a fear of scrutiny and of being “scooped.”
Neuroscience is no exception, and the barriers to the free flow of data are particularly acute in a field of such diversity and scope. But with a number of large-scale collaborative projects such as the U.S. Presidential Brain Research through Advancing Innovative Neurotechnologies (BRAIN) Initiative about to begin pumping out masses of new information, data sharing in brain research has become a pressing issue. Now, an initiative to produce a common language, or unified format, for neuroscience data has been launched by a handful of prominent neuroscience institutions, including The Allen Institute for Brain Science, the California Institute of Technology, the New York University School of Medicine, the Howard Hughes Medical Institute and the University of California, Berkeley. Named Neurodata Without Borders, it aims to begin to unlock neuroscience data by removing technical barriers.
Three of the researchers behind the initiative spoke to The Kavli Foundation about what a new era of open data could mean for neuroscience and how to get there.
- CHRISTOF KOCH - is the Chief Science Officer at the Allen Institute for Brain Science in Seattle, one of the partner organizations behind Neurodata Without Borders.
- KAREL SVOBODA - is the group leader at Janelia Farm Research Campus of the Howard Hughes Medical Institute in Ashburn, Virginia. Svoboda’s laboratory is contributing datasets to Neurodata Without Borders.
- JEFF TEETERS - is the project coordinator for Neurodata Without Borders and the developer and administrator of CRCNS.org at the University of California, Berkeley.
The following is an edited transcript of a roundtable discussion. The participants have been provided the opportunity to amend or edit their remarks.
THE KAVLI FOUNDATION: The Director of the Sanger Institute, Michael Stratton, has said that, “...data sharing will be central to extracting the maximum amount of knowledge for the benefit of humankind.” Is that the impetus for Neurodata Without Borders?
CHRISTOF KOCH: Yes. We need to understand each other’s data. There are about 10,000 neuroscience labs in the world. Right now, it’s a little bit like the Tower of Babel, where everyone speaks a different language. But if we want to be able to understand the brain, we need to be able to say, “I know what language your data is written in, I can read your data and manipulate it, and thereby understand the ways it differs from my data.” This way I can assess whether it is similar to mine and reinforces my conclusions, or is very different so we can understand where those big differences arise.
KAREL SVOBODA: We need to share data also because there is a paradigm shift happening in many fields of biology, including neuroscience. The data are becoming so overwhelming and multidimensional that any one laboratory cannot possibly analyze them to completion. So we have to make every effort to make high-value datasets available so that lots of other folks can mine the data to maximal effect.
JEFF TEETERS: The brain is so complex that there isn’t a good theory about how it works. As new theories are developed, they will need to be tested using data. And there might be analyses that people will want to perform on datasets that we don’t know about now. So it behooves us to preserve the results of the experiments we’re conducting today so that the data can be tested against new models of the brain in the future.
TKF: Do you think it should be an obligation for publicly funded researchers to make their data available?
KOCH: There should be as little obligation as possible, but I think data sharing should be warmly encouraged for the sake of open science.
SVOBODA: I completely agree with Christof. It should be encouraged. Although I think for some datasets, such as publicly funded atlasing projects, it should be obligatory. Granting agencies are heading that way but there are major infrastructural issues that have to be solved first. One of those is standardization of the data formats.
TKF: Why have you come together to tackle this problem now?
KOCH: One reason is there’s a growing realization that so much of science can’t be reproduced, and being able to compare each other’s data is going to critical to figuring out why that is. Two, U.S. government agencies have recognized the data-sharing problem and are beginning to think about how to address it at the national level. There’s a desire to do something because we are amassing big data about different brains. And three, particularly here at the Allen Institute, we want to release data from the brain observatories we’re developing. But first we need to decide what formats to follow so the publicly accessible data can be of maximal use to the worldwide community of scientists interested in the brain.
SVOBODA: There are also of course the Human Brain Project in Europe and the Obama BRAIN Initiative in the U.S. Those of us who have been planning these efforts, as well as the granting agencies, have identified data sharing as a serious issue. But no one knows exactly how to go about solving it. So one impetus for Neurodata Without Borders is to take a step ahead and tackle one major obstacle right now: the standardization of data formats.
Another reason is that what we consider scientific output, the published paper, is just a pale reflection of science itself. The era of presenting the results of neuroscience datasets in a couple of figures is basically over. Instead, the output of a study needs to be data that you can interact with.
TKF: Christof, can you tell me more about the Allen Institute’s brain observatories?
KOCH: We’re trying to build these large-scale behavioral observatories where we train mice, have them run, record their brain activity routinely, day in and day out—just like you would observe the universe at an astronomical observatory—and make this data publicly available to whoever wants it.
TKF: Has your own research ever been hampered by the inability to read older data from your own laboratory or data from someone else’s?
SVOBODA: Yeah. I’m now looking at a stack of notebooks with about 300 Post-It notes from a postdoc who left the laboratory just recently, plus a couple stacks of hard drives. Yes, I’m acutely aware of this problem and I wish to overcome it!
KOCH: Worrying about standards sounds like an unsexy activity, but once you have a standard then you can greatly accelerate progress. If you go back in history, standardization has often been the driver of key industrial innovations. In the early 19th century, people had to develop a standard for screws—–20 threads per quarter inch, that sort of thing—because you had a lot of different companies manufacturing different screws. These companies voluntarily came together and decided on a national standard. That really boosted industry. In modern times you see it in communication technologies and the Internet, where standardization has been absolutely essential as one of the drivers in progress.
SVOBODA: There have been successes in biology that are built on standards, too, for example, in genomics and certain fields of cell biology.
TKF: The current Neurodata Without Borders project is focused on “cellular neurophysiology” data, either optical or electrical recordings of brain activity at the cellular level. Why did you choose to focus on those?
SVOBODA: Those are the major ways to read out the kind of messages that hold information in the brain. Monitoring what individual neurons and even individual synapses are doing can only be done with cellular imaging or electrophysiology.
These two modalities of recording, optical and electrical, provide similar information but are based on different types of data. In one case, you have images as your raw data; in the other case, you have a time series of electrical signals from a bunch of neurons. The technical challenge is, how do you present them in the same format?
TKF: Are there lots of other types of data in the neurosciences that could similarly benefit from standardization and sharing?
SVOBODA: Absolutely. Looking at what neurons are doing in the brain in isolation—the cellular neurophysiology data that I just talked about—is interesting enough. But really, the full richness of those data only becomes apparent when they are aligned and compared with behavior: for example, what was the animal doing at the time of a particular neurophysiological event? How motivated was the animal? What was it learning? That all needs to be integrated into this data format because we need to look at how these neurons contribute to the behavior of the animal.
TKF: Neuroscience has been described as a field where everyone’s working in silos. Is that holding up the field in terms of addressing the data-sharing problem?
KOCH: It’s one of many problems impeding neuroscientific progress. We’re already dealing with the most complex system in the known universe: the brain. So let’s not make it even more difficult by talking different languages. If our data are written in the same alphabet using the same syntax then it becomes possible to read each other’s data much more easily. Then we can greatly speed up progress as has happened in other scientific communities, such as observational astronomy, climate research and genomics.
SVOBODA: What you often have in neuroscience is contradictory findings on ostensibly the same kind of question in the same brain area. I can think of several examples where you see, even in the same journal, opposing conclusions based on virtually the same experiment. If the data were available, a comparative reanalysis might be able to resolve these differences.
TEETERS: Another area of potential reward is unleashing all of the people that are making strides in machine learning, which is being put to use in Google’s self-driving car, and statistical analysis of so-called “big data.” If those techniques were applied to neuroscience data, maybe there would be things that could be learned that we just don’t know about yet. But right now the data is too inaccessible for people to do that.
TKF: Jeff, a number of online data repositories such as Figshare already exist. Why don’t they meet the field’s needs?
TEETERS: They’re limited in what they’re providing because they’re not striving to have a standard format for domain-specific data—in our case, neuroscience data. So people can go there and download the data, but then they have to spend time figuring out what the format is and searching for experiments that are relevant. For example, one dataset that we have at CRCNS.org—a data-sharing website for neuroscience and the repository for Neurodata Without Borders—is from Gyorgy Buzsaki’s group at New York University. It has about 7,000 neurons. To make it useful, we set up a relational database that allows people to search for data with particular properties. If the data are not organized well, you really can’t find what you need. Domain-specific repositories and standards are also needed to allow combining data from multiple experiments. This is very difficult if you just upload data without having some standards.
TKF: You’re hoping that other people will adopt these standards once you’ve set the example. That’s kind of like dangling a carrot in front of the research community. Do you think a carrot is enough or do you need a carrot-and-stick approach to solve the data-sharing problem?
KOCH: As Karel said, I think the funding agencies will soon probably get the stick out and force all of us to address this problem. But again, if you look at this history of standardization—for example for video formats like Betamax and VHS—it’s best if the community decides on a standard by consensus rather than having governments impose one. In neuroscience, if people want to make their data available to others, it is in their self-interest to partake in this standard. But first we have to make it convenient for people to use.
SVOBODA: I think we’re just a step ahead of something that’s coming down the pike. We want to make our best effort at selecting a format and hopefully seduce other folks into adopting our standards. The benefits are great. Right now, without standardization, the rule of thumb is that if you want to share a dataset with one collaborator, it’s at least a full week of effort to make the dataset understandable to the collaborator. With standardization this kind of overhead would go away. So we’re motivated to get this to work for our own parochial purposes. But I think if we lead by example, we’ll have deeper buy-in.
TKF: Are there certain challenges to sharing data internationally?
SVOBODA: No. It’s the same problem whether you are in the U.S., Europe, Japan or Timbuktu.
TKF: So after the first phase of Neurodata Without Borders, in which you aim to select a data format, adapt it yourselves and deposit those data to CRCNS.org, what comes next?
TEETERS: Having the data hosted at locations where there are supercomputer resources so that people don’t have to download large datasets but could go online to analyze them, is one thing that could come down.
SVOBODA: Absolutely. This goes to Christof’s point. It won’t make a difference if you’re in Timbuktu and the data sits in the “cloud” somewhere. These kinds of neuroscience datasets are so large that downloading them from a remote site is impractical. We have to figure out ways to deploy our analysis tools to act on data in the cloud.
KOCH: Once we have a format, one of the big things that we have to do is spread the message to the community. We’ll start off using the data format in our own labs. Then we have to try to encourage our friends and colleagues and students and postdocs to take the standard with them when they leave the lab. And then to message this far and wide. This mission is only going to be successful if, in five years, several thousand labs are using these standards.
TKF: Which brings me to the next question: Where do you hope the field will be in terms of data sharing in the next five years?
KOCH: Neurodata without borders!
SVOBODA: I hope that it will be routine that folks outside of experimental brain research—for example, machine-learning experts like those who Jeff mentioned earlier—gain fundamentally new insights by doing meta-analyses across multiple datasets generated perhaps half a decade apart in different laboratories. That would be a major step forward, a new kind of brain science.