The most obvious is to believe that the big data revolution has to do with mass, perhaps the most salient aspect of the data revolution coming from the interaction with interconnected devices such as cell phones, credit cards or social networks. Massive data, which “paired” with powerful algorithms seem to release science and the daily practice (public or private) of the old limitations of traditional statistics. By way of example, poverty in Greater Buenos Aires is measured by a survey of approximately 3,000 households (the Permanent Household Survey, which relies periodically on the INDEC), and an electoral poll – of those that will mushroom in the months to come- is based on no more than 1,000 observations. These figures sound ridiculous compared to the billions of data that spit out daily interactions through social networks.
A promise of classical statistics is that “more is better”, that is, if a statistic is correctly designed, a sample with more data has to be trivially better than one with less, in the sense that any electoral survey the long is a partial version of the election act, something like the mother of all political polls, since by definition it includes all the voters. From this perspective, big data should be the best news for data users, an authentic injection of life for previously forced disciplines to make cyclopean efforts to extract as much information as possible from a few observations.
But the comparison of big data with data from traditional sources (surveys, administrative records or laboratory experiments) is pears with apples. Big data is an ocean of anarchic, unstructured and spontaneous data, which in general are not generated by the mere purpose of obtaining them but with another objective. As an example, anyone who used Waze or Google Maps to go from home to work generated data, thousands of data, but not with the purpose of creating them but with choosing the best path.
On the contrary, the data of a survey or of an experiment come from millimeter protocols that try to guarantee that a few data can reliably represent a much larger population. Also, the real purpose of implementing a survey is to collect data and those who respond are doing it consciously: there is little spontaneous response to a traditional survey or the complications involved in carrying out a laboratory experiment.
Then, more is better if the survey or experiment is correctly designed to honestly reflect reality. It is in this framework that when a survey is well designed, more data is obviously better than less and that big data is then great news.
But by its very nature (complex, unsystematic) big data is not a properly designed survey but a flood of spontaneous data, and then more is not necessarily better. Worse, it is possible that a few very well-designed data (from a survey or an experiment) contain much more information and lead to more reliable conclusions than a sea of anarchic data when not biased. As an example, at the dawn of data analysis, back in 1799, the huge Carl Friedrich Gauss obtained important measurements on the shape of the Earth with only … .. four data !. Only four observations, but meticulously verified and studied in the light of a precise astrophysical theory. Likewise, the thousands of data that could be obtained from the users of a highway (with electronic sensors, digital cameras, etc.) say little (if not nothing) of those who do not use it, and that perhaps the relevant population is to be auscultated if it’s about making policy decisions, like investing in building an additional lane.
So, are we worse with the big data revolution? No, nothing more wrong. Big data is great news, but, as in many issues, not because of size issues but because the ocean of spontaneous data can provide crucial information about aspects of society that are often unattainable through traditional mechanisms, such as surveys or focus marketing group