Every week, we manage more than 23 exabytes of data. What challenges do researchers face with such an overwhelming information overload?
This is the fourth in a series of four articles by Dr. Jonathan Trinastic.
In 2002, human civilization duplicated or recorded 23 exabytes of data (that’s 23 followed by 18 zeros, or 23 billion gigabytes). That may sound like a lot, but fast-forward to 2017, and we are manipulating this much data in just one week. The amount of information at our fingertips is staggering, whether it be about Stephen Curry’s clutch 3-point field goal percentage at home against winning teams, or about the three-dimensional structure of proteins relevant to cancer treatment. It can sometimes feel like a second job just to catch the local news, stay up-to-date on global geopolitics, review the last 17 emails from relatives and advertisers, and still have the brainpower to remember your kid’s soccer practice.
Scientists are beginning to face similar problems with information overload. Public research databases are popping up in research, from neuroscience to social science, to organize the large quantities of related data collected by researchers around the world. The Worldwide Protein Data Bank (wwPDB), for example, houses information about newly discovered protein structures. In 2016 alone, researchers loaded over 11,000 new structures into the database. In total, nearly 130,000 structures are available to download and study. Then there’s the World Data Center for Climate, estimated to hold over 300 terabytes of climate data and simulation results as well as 6 petabytes of data on magnetic tapes.
Information overload aside, the upside of these massive amounts of data could be life-altering, literally. Public data repositories allow for collaboration between scientists around the world, even if they’ve never met, since their data is all stored in the same location. Pieces of data that would never have been brought together could now become pieces of a larger puzzle—leading to breakthroughs in medicine, climate predictions, energy technology, and more. Perhaps most importantly, public access to data will allow for reproducibility of scientific results to ensure the integrity of the scientific process.
Unfortunately, we have a long way to go to get to these starry-eyed promises of big data. Poorly organized or poorly described data uploaded by one researcher could be unintelligible and useless to another. Inaccurate data can quickly propagate errors in research results once everyone is using the same database (known colloquially as “garbage in, garbage out”).
What role does the federal government play in the world of big data? First, there is a tremendous opportunity to develop new standards for evidence-based decision making on policies and regulations. With more data, federal offices can quantitatively assess the impact of different programs and whether they indeed match agency goals. This is a difficult task, however, because it requires significant organization from the beginning of a program to plan data collection, management, and analysis of large data sets. Work is underway, however, as Congress proposed a Commission for Evidence-Based Policymaking last year to develop the expertise and infrastructure within the executive branch for this type of work. The new administration has an important choice to make on whether to continue this trend toward using data and quantitative evidence to support executive policy decisions.
Second, the federal government is a central hub for databases of both government data and information collected from federally funded research activities. Many examples can be found on data.gov; topics range from hourly electricity data from across the country to information about consumer complaints. The federal government can play a leading role in applying robust guidelines for database organization and integrity, such as the FAIR principles already proposed by researchers.
Information overload has become a common experience. Organizing these databases and ensuring the accuracy of the data inside them may be the most difficult challenge in the 21st century because success or failure will impact every other issue we face and how we go about solving it. Databases are now prevalent in fields including neuroscience, materials science, and climatology—fields that will impact our understanding of brain disease, energy technology, and climate change. But with all this data swarming into public repositories from myriad global sources, questions inevitably arise. Which datasets can we believe? What sources are most reliable? How much data is enough to efficiently solve a problem? As the use of data becomes an integral part of policymaking, federal officials must decide the government’s role in answering these questions with confidence and consistency in order for the age of big data to live up to its hype.
—Dr. Jonathan Trinastic earned his PhD in physics at the University of Florida. He is interested in renewable energy technology and sustainable energy policies, as well as living by Ernst Schumacher’s mantra that “small is beautiful.” Read more of Jonathan’s work at his personal blog, Goodnight Earth, and follow him on Twitter @jptrinastic. All views expressed are solely his own and do not reflect those of his employer.