“Every two days now, we create as much data as we did from the dawn of civilization until 2003.” – Eric Schmidt (Executive Chairman, Google) in 2011.
It was with this surprising quote that Dr. Anand Rajaraman chose to begin his lecture on “How Big Data Is Changing The World”, as he addressed an overflowing IC&SR auditorium at 5 pm on Tuesday, the 13th of August.
An alumnus (CS/BT/93) of the Institute who was awarded the Distinguished Alumnus award in 2013, Dr. Anand was presenting his second talk as part of the Alumni Leadership Lecture Series conducted by the Office of Alumni Relations; in fact, he recalled, it was his first talk that had inaugurated the Lecture Series.
In the 20 years since his graduation, Dr. Anand has founded several start-ups and worked as Director of Technology at Amazon; he describes himself now as an entrepreneur and a venture capitalist who works in the broad field of big data. As Eric Schmidt’s quote conveys, the volume of data generated has skyrocketed over the years. All this data is harnessed by what are collectively known as data-driven applications, which Dr. Anand loosely defined as “applications that use data in some non-trivial way.” There are four generations of these, which he proceeded to elaborate on:
“The first generation applications were found mainly in big companies in the late 80s and early 90s. These aimed to structure private data assets – inventory and employee records – effectively for competitive advantage. Some companies, however, went beyond this simple everyday automation,” he said, citing the example of association rule mining, wherein analysis is done to find out what items are frequently bought together, which is then exploited to boost sales.
He mentioned a surprising finding that these applications made: beer and diapers are often bought together. Curious researchers later dug up the reason – when there are babies in the house, people tend to visit bars less frequently, and thus drink at home.
The second generation of applications were about harnessing the power of public data. Prominent since the mid 90s, the best examples of these applications are search engines like Google, which work with public data.
The third generation leverages “semi-public” social and mobile data. These applications work with personal data, which has been shared publicly by user consent in sites like Facebook, Twitter and LinkedIn. The volume of such data available is enormous, and keeps growing – there are on average 500 million tweets and 3 billion Facebook shares every day. He provided the example of tweetbeat (which he had helped build), an application that analysed tweets from around the world and generated a dynamic newspaper based on trending topics.
The fourth, and most recent, generation aims to combine public and semi-public data with private data. “With this,” Dr. Anand said, “we’re combining oil (private data) and oxygen (public, freely available, data) to generate a veritable explosion of activity.” He gave several examples, including that of Shopycat, which aims to connect people and products through social media. It analyses people’s profiles, tries to figure out what products they like and suggests products that can be gifted to them.
He also talked about Cake Pops (a cross between a cake and a lollipop), which had once seen a sudden explosion of popularity, as measured by mentions in Twitter. This information was exploited by Walmart, which started “Make your Cake Pop” counters throughout the USA.
The phenomenon of big data, he went on, is changing the face of the industry. To succeed in a data-poor world, one had to employ complex Machine Learning algorithms on the sparsely available data to generate models. These models were then used to make relevant predictions.
In a data-rich world, however, these methods are not as effective. The more complex a model, the more brittle it is. The modern era, where data is abundant and far more dynamic, calls for new prediction methods that are “model light and data heavy”, which can adapt faster to a rapidly changing world. This, in a nutshell, is the big data approach. A good example is Google’s spelling correction engine, which is very effective despite using an algorithm that does not incorporate any understanding of the English language – it works by comparing the input to similar searches.
The speaker wound up by speaking about how historically, progress has always been driven by data collection. In the 16th century, the data obtained by astronomers was analysed by Isaac Newton to arrive at his famous laws of motion and gravitation. Thus, in some sense, it was this information about the skies, which was the only data that could be collected easily back then, that triggered modern physics. Today, however, there is an availability of data about people and society on an unprecedented scale. “This can only mean,” Dr. Anand concluded, “that we are now on the cusp of a revolution in the social sciences.”
T5E has published an exclusive interview with Dr. Anand and his classmate and fellow Distinguished Alumnus 2013 awardee, Dr. Hari Balakrishnan, which can be found here.