Marketers, media companies, agencies, and researchers frequently come to us with questions about how to make sense of the enormous data streams at their disposal. While we're accustomed to doling out advice and pointing people to relevant studies on set-top-box data, scanner data, or clickstream data, we want to challenge the industry to think bigger when it comes to big data. So we went as big as we could—outer space. We caught up with Jon Jenkins, Senior Research Scientist, SETI Institute at NASA Ames Research Center, and asked him how he makes actionable insights out of galaxies of data. We hope his answers will give you some perspective on how to deal with your own data issues, even if they're merely terrestrial.

ARF: Can you briefly describe the research you do?

Jenkins: I develop science algorithms for processing photometric data from the Kepler mission to produce science archive products and to detect and characterize weak planetary signatures in the data. Our goal is to determine what fraction of stars in our galaxy host potentially habitable Earth-size planets. Kepler takes images of over 150,000 stars in order to detect small dips in the brightness caused by instances where a rocky planet like Earth passes in between the space telescope and the planet's host star. We have to process the raw image data to calibrate it and extract brightness measurements for each star for each half hour interval. The pipeline corrects the data for instrumental effects and then searches for these very weak signatures in the data to identify stars that might have planets. The pipeline then conducts a series of diagnostic tests to make or break the confidence in the signature's planetary nature. We furnish the list of stars with transit-like features and the diagnostics to the science team for follow up and analysis.

ARF: What are the biggest challenges you face in making sense of such large data sets?

Jenkins: The biggest challenge is to learn how various instrumental effects manifest themselves across the 95 million pixels in Kepler's camera and across the ~200,000 stars we've observed since the start of the mission, each of which has its own unique behavior to consider.

Kepler is NASA's first mission capable of finding Earth-size planets orbiting Sun-like stars. It's an order of magnitude better than any previous space photometer and about 2 orders of magnitude better than any ground-based photometer. Kepler is collecting data with unprecedented precision, duration and contiguity. The photometer is exquisitely sensitive, otherwise it couldn't do its job, but it's also sensitive to its thermal state. We're finding that the biggest challenge in finding the very small signals we're looking for is caused by a combination of instrumental effects and the fact that most of our target stars appear to be more variable with respect to their brightness output than our Sun. Dealing with large data sets has forced us to design a set of processing clusters that can keep up with the data accumulation rate. However, we have moved the most computationally intensive parts of the pipeline, namely the identification and characterization of planetary signatures, to the Pleiades supercomputer at NASA Ames Research Center (which is also where the Kepler Science Operations Center is located).

We are learning about how to deal with the instrumental effects as we gain experience with the data and the spacecraft, and it is frustrating that we can't instantaneously reprocess all the data as we upgrade the pipeline. I wish we could reprocess the data much faster than we can, as it will take us about 7 months to reprocess all the data we currently have and "catch" up with the new data with the new software we're just about to release. We are limited by the speed with which we can pull data from the filestore and the number of processing cores we have available with our own clusters.

ARF: What kinds of general questions should researchers be asking when working with big data? Can you give any advice on how to conceptualize enormous datasets like those you work with?

Jenkins: I think it's very important to think beyond the use case of the nominal processing when you are designing an automated pipeline that must process significant amounts of data, especially if that data set accumulates over time and full reprocessing is a necessity. I'm thinking specifically about the use case of accessing the data to diagnose problems with the software or with the instrument and the fact that the scientists and programmers who are evolving the software need to be able to access and work with large fractions of the data in order to do a credible job. If you design a pipeline to run well in an unsupervised environment and don't think about how the people responsible for the quality of the data products need to interact with the data and how they need to prototype and test new algorithms, you may unintentionally hamstring your ability to move forward in solving the unknown problems that inevitably happen when you are breaking new ground in science and technology.

Want to hear more? Jon Jenkins, and other data experts, will be speaking at our Industry Leader Forum on October 27. Registration is open.