Monday, June 3, 2013

Back to Thinking of Big Data


So, back to the premise that there is a lot of data that we can't get through to figure out what is exactly causing El Niño. The video above shows a good overview of what "Big Data" is (or the problems inherent). Basically, the (typical) problem with big data is that relevant information is usually destroyed before it becomes of any use. The difficultly in deriving meaning from data, means it is disposed of to create more space for newer data, which will also be disposed of due to lack of processing ability.

If we go back a few weeks ago, I created a list of data sets that might be attainable for researching El Niño's causes. Even if I were to obtain that information, there is a very real risk of the data just dropping through the analysis without being caught in whatever fish net (a.k.a., analysis) we're using to sort the information.

So, the temptation with Big Data is to story the analysis and dump the data. For example, let's say we are using a storage unit. Every time we fill the storage unit, we create an inventory list of everything in the unit. However, when we empty the storage unit, we destroy that inventory list. Instead, we just have a record that the storage unit was filled and did contain something. Then the storage unit is refilled and a new inventory list is created. We have no way of knowing if what was in the storage unit the first time was in anyway related to what was in the storage unit the second time. We no longer have that first inventory list.

In a way this makes a lot of sense, because there seems to be a terminal point at which data is of use. For instance, I was taking a marketing class. We were using SPSS to analyze some number set. There were a number of different variables. There were so many variables that the way we were measuring the fit of our model (R^2) just kept improving, even though the fit of the model to the variable was NOT improving. There was simply no use for the 20 data sets we had access to because we could do better modeling with three of those data sets.

However, this becomes similar to taking the derivative in calculus. Yes, we get new useful data, but we can no longer see the big picture. If we keep analyzing the analysis, we go from a complex series of curves to a straight line.

Complex Curve, Hyperbolic Curve, Straight Line
WyzAnt Tutoring Graphic Showing an Example of Position, Velocity and Acceleration of a Particle
The equivalent in the  El Niño case would be analyzing the results of many analyses together, perhaps not unlike one would take a derivative of a derivative in calculus. For example, we could analyze how an ocean temperature study compares to an air temperature study and then compare the results of that study to the lunar orbit. Does the data matter? I suppose that remains an unknown for the time being. My guess would be "yes."

No comments:

Post a Comment