I'm winding down after a productive sabbatical leave and of course summer's waning days. I should be working on that new course preparation (yikes, I'm teaching in a week!), but I wanted to say something about trends in recent months (and years) connected to Big Data in the media and science.
It strikes me that the original instantiation of big data is meta-analysis. Isn't it the goal of meta-analysis to pool all the relevant data into one analysis? Yes. So literally every meta-analysis could claim to be the biggest dataset qualifying on a subject. I've been doing meta-analyses for nearly 30 years and one thing that always attracted me was the increased statistical power I could wield because I was capitalizing on the law of large numbers: As available observations increase, statistical trends regulate and inferences become surer. An approach embracing big data should in theory bring one closer to the truth, other things equal. (Of course, they are not necessarily equal: Data can vary in quality, etc.)
In contrast, the term big data is usually reserved for massive pools of internet-gathered webhits and the like, although usually not merged from independent studies. Gapminder is essentially a big data approach, tracking nation-level data. Those data in turn are gathered from quasi-independent surveys that other bodies such as the UN or the World Bank collect. (One could argue that databases spanning many economic and social surveys are essentially a meta-analysis, whenever temporal trends are examined, though they are rarely if ever described that way.)
Google is essentially a mammoth (viz. BIG) database, optimized for your text searches. The same for Google Scholar. These can be used as a quick means for gauging temporal trends. From the figure below, you can see that meta-analysis is ever more popular in the last 21 years, leaving the 2013 figure unadjusted (only about 8 months have passed so far this year).
Caveats and queries:
- Methods note: To produce that figure, on 18 August 2013, I searched Google Scholar for 'meta-analysis' in the titles of reports. The trends presented are the percentages of reports of the total available in each year, where the total is determined by searching for the letter 'a' (the most frequently occuring character) in title or abstract.
- That method of searching titles suggests a cumulative number of meta-analyses [in title] numbering somewhere around 55,000; if you directly search Google Scholar without restricting year-by-year, it comes up with a number much larger: 115,000. Can anyone explain the difference? It's hard to imagine that there are 60,000 titles without dates!) Here are the temporal trends in number of meta-analyses and all literature:
- In the preceding graph, there is dramatically less total literature (annual publications) since 2003. I suspect that is an anomaly associated with Google Scholar and the long time that it takes some reports to reach Google's spiders or get posted on the internet. Does anyone have other explanations? The graph should give you pause about assuming that Google Scholar is always up to date on every subject. As time passes, you can test my hypothesis, because re-searching Google Scholar should yield higher sums than my totals, especially for the negatively-sloped portion of the dashed line.
- In turn, the suggestion is that the dramatically rising (solid) line for meta-analyses in the literature is also an underestimate and the trend is even more dramatic.
- It seems everyone is doing meta-analysis! Welcome to the party!
- Caveat: Note also that if I used synonyms for "meta-analysis" or expanded its use into abstracts, there would be dramatically more meta-analyses, so what we see in the 1st graph above could merely reflect a growing convention to put "meta-analysis" in titles, not actually an increase in the frequency of meta-analytic reviews in scientific literature.
- NB. The grammatically inclined reader will realize that, in this blog entry, I have not once used a singular verb (e.g., "is") with the subject "big data." Call me old-fashioned and contrarian, but the word "data" is plural! My consolation is to avoid using "big data" as a subject in any of the sentences; example: "it's a big-data approach." It's kind of fun to think about a big datum (a single piece of very large information!). Is a "big datum" even possible? The concept of "big data" is such an amazing contradiction when used with a singular verb, ironically compressing all of the teeming observations and variables into a single mass. Even if you are not grammatically inclined, you can sidestep the problem by using subjects like datasets, databases, collections of data, etc. May your verbs always agree with their subjects!