1 00:00:01,130 --> 00:00:20,270 So said. I am deeply humbled to be here tonight, and I'd like to, if I do well. 2 00:00:21,200 --> 00:00:26,690 Which is a big if. Dedicate this lecture tonight to my. 3 00:00:27,760 --> 00:00:30,910 High school teacher. And here is why. 4 00:00:31,740 --> 00:00:45,150 Many years ago, more than three decades ago, when I was in high school, I was selected to prepare for and then to compete in the Mathematics Olympics. 5 00:00:47,080 --> 00:00:53,110 Because they thought people thought in Austria that I was smart enough to be a good contestant. 6 00:00:56,140 --> 00:01:05,420 And to my embarrassment, they didn't do very well. So my high school teacher, Hans Kraus, came to me and he said, 7 00:01:07,100 --> 00:01:14,929 You're just not smart enough for a real good mathematician, but maybe you can be an applied mathematician. 8 00:01:14,930 --> 00:01:21,410 So how about the how about physics Olympics? 9 00:01:25,940 --> 00:01:30,090 So I did. And I did very well. And. 10 00:01:33,000 --> 00:01:43,590 With that, I shall start and dedicate this lecture to Hans Kraus, who gave me the guts to speak to an audience of mathematicians today. 11 00:01:45,930 --> 00:01:49,230 I'm starting to talk about big data with a case study, 12 00:01:49,890 --> 00:01:53,970 a particular case that has to do with something that you're all familiar with, and that is the flu. 13 00:01:54,780 --> 00:02:00,090 Every year, tens of thousands of people around the world die of the seasonal flu. 14 00:02:00,840 --> 00:02:06,720 But in 2009, a new flu virus was detected, the H1N1 virus. 15 00:02:07,260 --> 00:02:10,110 And at that stage, there was no vaccine available. 16 00:02:10,890 --> 00:02:18,480 So the best that public health authorities around the world could do was to try to limit the spread of the virus. 17 00:02:18,480 --> 00:02:22,320 But for that, they needed to know where the virus was. 18 00:02:23,830 --> 00:02:31,120 In the United States. This role is taken on by the Centres for Disease Control in Atlanta and they have doctors, 19 00:02:31,120 --> 00:02:37,300 general practitioners from around the country to tell them each and every H1N1 case that they have. 20 00:02:39,320 --> 00:02:48,440 And based on that and some hard labour over days and days and days, they're able to tell the policymakers. 21 00:02:49,950 --> 00:02:53,790 Were the flu is at any given point in time. 22 00:02:54,700 --> 00:02:58,050 Two weeks later. Which is an eternity. 23 00:02:58,350 --> 00:03:04,390 If you have a pandemic underway. Just around the same time. 24 00:03:05,280 --> 00:03:14,520 Engineers at a little Start-Up Company called Google in Mountain View had a very different idea of how to predict the spread of the flu. 25 00:03:14,790 --> 00:03:21,460 They said, we'd like to predict the spread of the flu by just using Google searches. 26 00:03:21,480 --> 00:03:33,400 That is search requests sent to Google. Google receives about 5 billion search requests every single day and has stored and saved all of them, 27 00:03:33,400 --> 00:03:37,440 including information where they came from and so forth over the last 15 years. 28 00:03:37,810 --> 00:03:44,440 So the idea was to take the last five years or so of Google search requests that it received and to take 29 00:03:44,440 --> 00:03:50,560 official figures from the Centres for Disease Control and to see whether there would be a correlation, 30 00:03:51,460 --> 00:04:04,370 some kind of a connection. And in that process, Google's engineers tested 50 million different search terms for 150 million mathematical models. 31 00:04:04,940 --> 00:04:11,300 And then they found one that provided a pretty darn good prediction. 32 00:04:12,970 --> 00:04:18,010 Here you see the official CDC data and the Google Trends predictions. 33 00:04:18,960 --> 00:04:28,980 But. Importantly were the CDC prediction was always about 10 to 14 days late. 34 00:04:29,730 --> 00:04:33,720 Google could do it almost in real time. And. 35 00:04:34,740 --> 00:04:40,350 The idea behind this, the direction that this points towards is big data. 36 00:04:40,680 --> 00:04:47,850 Now, big data is not just about public health. We find it everywhere, including in finance. 37 00:04:48,650 --> 00:05:00,020 In financial services. The vice chancellor has already alluded to, I think, in his speech, the importance of this matter, 38 00:05:00,260 --> 00:05:08,840 especially in light of the Great Recession of 2008, when we look at September of 2008. 39 00:05:11,150 --> 00:05:16,610 And when we now read the protocols. Of the Open Market Committee. 40 00:05:17,650 --> 00:05:25,720 Of the Federal Reserve. We find that they were clueless about what was going on because the data was not there. 41 00:05:26,620 --> 00:05:34,570 In fact, the deflationary tack after the Lehman Brothers default, after the Lehman Brother bankruptcy, 42 00:05:35,110 --> 00:05:44,350 was only felt in the CPI and the Consumer Price Index many weeks later because it takes many weeks to do the Consumer Price Index. 43 00:05:46,680 --> 00:05:52,590 There's a small Start-Up Company out of a research project at the Massachusetts Institute of Technology. 44 00:05:52,980 --> 00:06:02,820 Start-Up Company now is called Price Stats. And what they do is they go out on the Internet and every single day, every single hour, they suck down. 45 00:06:03,870 --> 00:06:10,770 Over a billion price points of consumer goods from hundreds of thousands of different offerings. 46 00:06:12,030 --> 00:06:18,870 And they do this in order to get very early indications of inflationary or deflationary trends. 47 00:06:20,350 --> 00:06:24,720 And so they were able to. See. 48 00:06:25,870 --> 00:06:32,950 The deflationary impact of Lehman Brothers and what came afterwards much earlier, if in fact. 49 00:06:33,890 --> 00:06:43,880 Policymakers had known if in fact, Federal Reserve policymakers had known perhaps some of the policy decisions would have been made differently. 50 00:06:44,660 --> 00:06:52,460 It is big data. The idea that we can gain insights from a large number of data points that we couldn't otherwise. 51 00:06:52,520 --> 00:06:58,250 Now, all three human history. We human beings have tried to make sense of the world around us. 52 00:06:59,950 --> 00:07:08,660 And we did that mostly by observing it. And in order to do that, we needed to collect data. 53 00:07:10,550 --> 00:07:21,680 But for all of human history, until extremely recently, the collection of data was hard, difficult, time consuming, expensive. 54 00:07:22,160 --> 00:07:30,860 And so in this world of small data, we collected as little data as absolutely necessary in order to answer the question that we had. 55 00:07:31,160 --> 00:07:41,480 We defined, designed, created the processes, the institutions, the structures that we used in order to make sense of the data. 56 00:07:42,050 --> 00:07:51,910 Knowing that we had very little of it. And so that we needed to squeeze every last drop of meaning out of them. 57 00:07:53,060 --> 00:07:58,580 What if that changes? What if we could envision a world in which there is. 58 00:08:02,800 --> 00:08:07,510 A lot of data available. Then we would have to rethink. 59 00:08:08,800 --> 00:08:13,390 Perhaps the processes and the institutions and the structures that we use to make sense of the world, 60 00:08:13,660 --> 00:08:19,270 much like this gentleman who is speaking out of the old world view and into a new one. 61 00:08:22,520 --> 00:08:29,210 It all began perhaps some 15, 20 years ago in earnest in the natural sciences. 62 00:08:29,660 --> 00:08:38,150 Consider astronomy. When the Sloan Digital Sky Survey came online in the year 2000, it collected. 63 00:08:39,400 --> 00:08:47,710 More data in its first few weeks of operation than had been amassed in the entire history of astronomy. 64 00:08:49,660 --> 00:08:58,030 Since then, since the year 2000, it has amassed over 200 terabytes of astronomy data, 200 terabytes of astronomy. 65 00:08:59,240 --> 00:09:07,880 But a successor telescope, you see a rendering that is going to go on stream in the year 2016 is going to collect that much data. 66 00:09:08,800 --> 00:09:14,070 Every five days. Or take genetics. 67 00:09:14,970 --> 00:09:20,190 In April of 2003, the world celebrated a tremendous achievement. 68 00:09:20,760 --> 00:09:25,440 That is, after ten years, $1 billion and a global effort. 69 00:09:25,740 --> 00:09:32,340 We, the world, were able to have a full sequence of one person's DNA. 70 00:09:32,640 --> 00:09:39,400 3 billion base pairs. Fast forward to today and sequencing my DNA. 71 00:09:39,940 --> 00:09:47,230 Not just one person's DNA. My DNA costs about less than $1,000 and takes 2 to 3 days in one lab. 72 00:09:48,760 --> 00:10:00,280 And that generates another 3 billion base pairs of data Internet companies to a drowning in data, 500 million tweets with Twitter every single day. 73 00:10:00,940 --> 00:10:05,530 800 million YouTube users upload an hour of video every single second. 74 00:10:05,680 --> 00:10:09,580 Even if you stop sleeping, you could never watch that amount of video. 75 00:10:11,110 --> 00:10:15,760 And on Facebook, 10 million photos are uploaded every single hour. 76 00:10:17,170 --> 00:10:21,430 Google processes dozens of petabytes of data every single day. 77 00:10:22,360 --> 00:10:25,810 Dozens of petabytes. Petabyte. 78 00:10:25,990 --> 00:10:30,040 Petabyte. Petabyte. Petabyte. What did you have for lunch? 79 00:10:30,430 --> 00:10:36,910 A couple of petabytes. How much is a freaking petabyte of data? 80 00:10:38,430 --> 00:10:50,489 If you take all of the characters in a book and all of the books and magazines and all of the other holdings of the largest library in the world, 81 00:10:50,490 --> 00:10:55,500 the Library of Congress together, and then multiply it by 100. 82 00:10:56,040 --> 00:11:03,210 That's about a petabyte. If we look at the growth of data in the world, 83 00:11:03,450 --> 00:11:14,610 the best guesstimate that we have over time tells us that the amount of data in the world from 1987 to 2007 grew 100 fold. 84 00:11:17,560 --> 00:11:21,590 Now. That's quite amazing, isn't it? 85 00:11:21,980 --> 00:11:29,390 100 times increase if Elizabeth Eisenstein maintains we go back in human history to find. 86 00:11:30,410 --> 00:11:41,350 A time when data are increased as much. We have to go back to 1450 to a rough 1506 in these 56 years of the Gutenberg revolution. 87 00:11:43,260 --> 00:11:50,460 The amount of data in the world doubled. Here in 20 years, we have 100 X. 88 00:11:52,580 --> 00:11:56,450 But that's only half the story. And it's quite a powerful one already. 89 00:11:56,780 --> 00:12:03,790 The other half, the story is denoted by the different colours. The light pink denotes analogue data. 90 00:12:03,800 --> 00:12:06,680 The dark purple denotes digital data. 91 00:12:06,920 --> 00:12:18,950 And if you look at this white vertical line here, that is the year 2000, the year 2000, three quarters of the data in the world was still analogue. 92 00:12:21,610 --> 00:12:24,620 Today. It's less than 1%. 93 00:12:25,700 --> 00:12:30,860 Within 15 years, we have moved from an analogue world to a digital world. 94 00:12:31,250 --> 00:12:38,210 And that, of course, means that it is easier to collect, easier to store, easier to analyse, easier to retrieve. 95 00:12:39,240 --> 00:12:42,940 What does this do? How can we imagine? 96 00:12:43,090 --> 00:12:47,200 What are the results of this? The consequences? Well, think about it. 97 00:12:47,560 --> 00:12:58,810 The real element, the essence here that I want to convey is that if you increase something radically in quantity, it can take on a new quality. 98 00:13:00,190 --> 00:13:04,490 Consider photography. If I take a photo of a rider on a horse. 99 00:13:04,640 --> 00:13:09,590 That's a photo on the rider. On a horse. If I take a photo every second of a rider on a horse. 100 00:13:09,770 --> 00:13:21,230 That's a lot of photos. All right. On a horse. But what if I take 16 photos per second on a rider on a horse and show them in fast succession? 101 00:13:21,410 --> 00:13:32,630 Then this added quantity of data, a quantity of images that I have, translates into a new quality into in two. 102 00:13:36,220 --> 00:13:45,790 Moving pictures. In essence, what big data provides us with is a new perspective on reality. 103 00:13:47,980 --> 00:13:51,610 Now. How can we characterise that perspective on reality? 104 00:13:51,790 --> 00:13:55,389 Let me try with three words here more messy and correlations. 105 00:13:55,390 --> 00:13:55,960 First, more. 106 00:13:56,200 --> 00:14:05,470 More means that we have more data today available relative to the problem we are studying or the phenomenon that we are trying to investigate. 107 00:14:05,800 --> 00:14:12,130 It's not necessary to have a billion data points. You can have 60,000 data points and still be doing big data. 108 00:14:12,150 --> 00:14:20,170 If if that encapsulates almost all or quite close to all of the phenomenon that you are trying to study. 109 00:14:20,170 --> 00:14:26,770 For instance, if you have that amount of data, that comprehensive dataset available, 110 00:14:26,770 --> 00:14:34,600 then you can let the data speak rather than have it answer questions that you already had in mind when you. 111 00:14:36,130 --> 00:14:42,160 Collected the data. What do I mean with that? Let me use photography again. 112 00:14:42,400 --> 00:14:45,700 This is the moment here that I like the most. 113 00:14:46,150 --> 00:14:50,680 It is the moment when I take a picture of you. Now. 114 00:14:50,680 --> 00:14:54,590 Would you please smile? Okay. 115 00:14:54,860 --> 00:14:57,890 Now, as I take this picture, I have to make a decision. 116 00:14:58,850 --> 00:15:04,520 The decision is, who do I focus on? Do I focus on the Dapper Bill Nighy and the first row? 117 00:15:06,440 --> 00:15:11,330 Or do I focus on you back there and the last row? 118 00:15:12,630 --> 00:15:17,520 If I focus on you, Bill, unfortunately, you get back there, you'll be out of focus. 119 00:15:17,940 --> 00:15:22,470 You'll be blurred. And afterwards, I can bring you back into focus. 120 00:15:23,190 --> 00:15:32,710 It's gone. The data isn't there. So I need to know at the moment of collecting what really is important for me and what isn't. 121 00:15:33,970 --> 00:15:38,400 But what if I don't? Did. Start over again. 122 00:15:38,880 --> 00:15:45,180 Good luck. What if we could have something that would be better than that? 123 00:15:45,780 --> 00:15:48,930 Well, take a photo again as an example. 124 00:15:49,170 --> 00:15:59,630 This is a photo of a toothbrush. It's in focus back in the blurry part of the photo you find, actually. 125 00:16:01,800 --> 00:16:06,930 A photo of my four year old son. I can't put him back into focus. 126 00:16:06,930 --> 00:16:12,350 Right. He's blurred. Too bad to lend a. 127 00:16:13,700 --> 00:16:19,670 This is not a normal photo. This is a photo taken with a big data camera called a lateral light field camera. 128 00:16:19,700 --> 00:16:23,420 Here it is. And so when I take a photo, it's a huge file. 129 00:16:23,570 --> 00:16:27,400 It takes all of the focal pains. It takes all of the light rays in. 130 00:16:27,410 --> 00:16:35,450 And I can click on my son and there he goes and comes into focus because all of the data is present in the photo. 131 00:16:35,720 --> 00:16:37,460 Or I can click, of course, on the toothbrush. 132 00:16:37,760 --> 00:16:46,940 If I'm more interested in that and that comes into focus, I therefore can let the data speak and ask it question. 133 00:16:48,860 --> 00:16:54,860 That I didn't know that I wanted to ask when I collected. The second element is messy. 134 00:16:55,700 --> 00:17:03,100 And let me just very briefly say that in the big data age. We will combine datasets of varying quality. 135 00:17:04,490 --> 00:17:14,300 And that requires us to give up a little bit on our desire coming from the Small Data Age to focus on exactitude of data and data quality. 136 00:17:15,140 --> 00:17:21,740 What we gain in volume, we can trade off in quality, at least to a certain extent. 137 00:17:21,980 --> 00:17:26,700 More and messy together lead to insights through correlations. 138 00:17:26,720 --> 00:17:35,930 Now, in the first class, in statistics, what they tell you is data does not give you causality, only correlations. 139 00:17:37,720 --> 00:17:41,560 That's true. It's incredibly hard for human beings to understand. 140 00:17:42,910 --> 00:17:48,550 So take the example of global supermarket chain Walmart. 141 00:17:49,150 --> 00:17:54,430 Walmart captures all kinds of transactional data about what people buy and when and so forth. 142 00:17:54,430 --> 00:17:56,800 So they get a big data analysis to find out more. 143 00:17:57,220 --> 00:18:07,420 And they discovered that just before a hurricane hits a Wal-Mart location, people go to the Wal-Mart and they buy batteries and flashlights. 144 00:18:09,200 --> 00:18:16,370 Of course I would have thought so. But then they discovered that they also buy Pop Tarts. 145 00:18:19,700 --> 00:18:26,230 Actually strawberry Pop-Tarts. Pop-Tarts is a sugary American snack. 146 00:18:26,240 --> 00:18:32,670 Please note that I do not call it food that is being sold at Walmart. 147 00:18:33,650 --> 00:18:40,469 When they found this out through correlational analysis, these researchers immediately said, Oh my God, why is this the case? 148 00:18:40,470 --> 00:18:44,840 So why is this the case? Oh, and they came up with all kinds of hypotheses, right? 149 00:18:45,080 --> 00:18:48,080 You buy these pop tarts to feed your kids, you buy the Pop-Tarts, 150 00:18:48,080 --> 00:18:54,530 because if you eat a lot of Pop-Tarts, you basically hallucinate the hurricane away or whatever. 151 00:18:56,160 --> 00:19:00,060 Until a researcher said, Stop, timeout. 152 00:19:00,880 --> 00:19:04,750 We don't know. The data doesn't tell us. And guess what? 153 00:19:04,990 --> 00:19:11,200 We don't need to know. All that we need to know is what is happening, not why. 154 00:19:12,070 --> 00:19:16,360 And that is good enough in this particular instance. And so the other said. 155 00:19:16,360 --> 00:19:25,690 That's right. So since then, Wal-Mart before hurricane moved the Pop Tarts from the back of the shop to the front and sells even more of them. 156 00:19:28,150 --> 00:19:29,709 But for us human beings, 157 00:19:29,710 --> 00:19:39,330 that's really hard to understand because we human beings are almost hard coded to see the world as a sequence of of cause and effect. 158 00:19:40,660 --> 00:19:45,309 We can't escape that. Daniel Kahneman, the Nobel laureate in economics, 159 00:19:45,310 --> 00:19:52,630 said that this is this fast thinking that we think we make sense of the world, even though oftentimes. 160 00:19:53,830 --> 00:19:57,100 It's kind of like it makes us feel comfortable. 161 00:19:57,250 --> 00:20:01,690 It gives us the feeling that we understand the world, even though we don't. 162 00:20:02,560 --> 00:20:11,560 And so, for example, when I would have had dinner with you in Hall and one of the colleges that shall remain unnamed here, 163 00:20:11,600 --> 00:20:16,960 just a hypothetical here last night. And I would have had a stomach bug this morning. 164 00:20:18,180 --> 00:20:23,250 I would immediately have connected what I eat in hall with the stomach bug, 165 00:20:23,400 --> 00:20:29,310 even though it is far more likely that I would have gotten my gastroenteritis bug by shaking hands with some of you. 166 00:20:31,510 --> 00:20:35,950 Our brain cannot stop creating these causal linkages. 167 00:20:36,580 --> 00:20:44,140 But when we do so, we must be incredibly careful because often times we are just following the wrong path. 168 00:20:44,170 --> 00:20:53,139 So rather than jumping to a quick conclusion about why things are the way they are, it would be better to just first know what the things are. 169 00:20:53,140 --> 00:20:59,820 They are. Or in other words, we need to learn to walk because we can run. 170 00:21:02,300 --> 00:21:09,910 Hmm. Now. One way of looking at this is through the example of machine translation. 171 00:21:11,350 --> 00:21:21,370 In the 1950s, the US government has had amassed a lot of documents in Russia, but they didn't have enough translators, so they said. 172 00:21:22,640 --> 00:21:28,160 Will translated with the help of computers in the wind, the computer scientists and the computer scientist said, Oh, this is easy. 173 00:21:28,280 --> 00:21:36,020 We teach the 200 or so grammatical rules into the computer, a dictionary to it, and in three months we are done and we have machine translation. 174 00:21:39,310 --> 00:21:44,470 12 years later and about $1,000,000,000 later, that project was declared a failure. 175 00:21:45,430 --> 00:21:51,840 Nothing happened in machine translation until the 1980s, when engineers at IBM had a very different idea. 176 00:21:51,850 --> 00:21:54,970 They said, Why don't we give up on teaching the computer? 177 00:21:55,000 --> 00:22:02,080 Why one word in one language translate into one word in another language and just go with what that is with 178 00:22:02,380 --> 00:22:09,430 statistical correlations and what word is most likely going to be translated from one language into another language? 179 00:22:11,360 --> 00:22:18,950 And they had a great training text that with the proceedings of the Canadian Parliament available in English and French, 180 00:22:18,950 --> 00:22:24,230 and they did it and it worked beautifully. It was the first machine translation to actually was useful. 181 00:22:25,690 --> 00:22:30,429 Then they thought that they could make it even better by. By triggering, by changing. 182 00:22:30,430 --> 00:22:36,630 By tweaking the algorithm. But it didn't really matter very much, so they gave up. 183 00:22:37,200 --> 00:22:40,990 Ten years later, this Start-Up Company in Mountain View that I already mentioned. 184 00:22:41,010 --> 00:22:49,409 Google got into the fray. A German Google engineer by the name of France UX said The problem is not the algorithm. 185 00:22:49,410 --> 00:22:52,800 The problem is the amount of data. We need just more data. 186 00:22:52,980 --> 00:23:00,990 And so they sucked the entire World Wide Web, you know, all the different language versions of the websites of the European Union. 187 00:23:00,990 --> 00:23:10,620 Finally, they're good for something. We're sucked in all of the multilingual websites of these multinational corporations where 188 00:23:11,160 --> 00:23:15,900 all of the PDF users manuals that can be downloaded from your route to do your ironing board. 189 00:23:16,890 --> 00:23:25,920 To your VCR. And when I read this and when I heard this and when friends told it to me, Ken and my and myself, we said, You must be kidding. 190 00:23:26,190 --> 00:23:35,070 I can't even read the English version of the manual of my VCR because it was written in Shenzen or somewhere. 191 00:23:36,800 --> 00:23:44,210 And he said, it doesn't matter if you have so much of it. That little blip in the quality really doesn't matter. 192 00:23:45,390 --> 00:23:53,340 And so machine translation in Google Translate is a wonderful example of more messy and correlations. 193 00:23:54,230 --> 00:23:59,720 But if you think now that this is all just about the Internet and Internet companies, think again. 194 00:24:01,270 --> 00:24:09,040 We heard about health just a little earlier. And health is an area where big data will make tremendous inroads. 195 00:24:10,490 --> 00:24:17,840 Think, for example, of a particularly vulnerable group of human beings, premature babies like this one. 196 00:24:18,680 --> 00:24:25,190 Dr. Callum McGregor at the University Hospital in Toronto had the idea to use big data to help the premature babies. 197 00:24:25,490 --> 00:24:32,120 Premature babies are particularly vulnerable because we discover that they have an infection, often too late. 198 00:24:32,180 --> 00:24:35,670 Symptoms manifest themselves too late. So what did they do? 199 00:24:35,690 --> 00:24:45,140 They had digital sensors that measured the vital signs of these premature babies to the tune of 1200 data points a second. 200 00:24:46,120 --> 00:24:50,919 And they collected them over hours and days and weeks and dozens of babies and then look for 201 00:24:50,920 --> 00:24:55,960 patterns with a high degree of likelihood would predict the onset of an infection later on. 202 00:24:56,350 --> 00:24:57,340 And they found the pattern. 203 00:24:57,820 --> 00:25:05,860 And so they now can predict the likely onset of an infection 24 hours before the symptoms manifest themselves, just by looking at the patterns. 204 00:25:06,130 --> 00:25:10,390 And guess what? There are two kickers in the story. First, what is the pattern? 205 00:25:11,170 --> 00:25:14,290 The pattern is that suddenly the vital signs stabilise. 206 00:25:15,140 --> 00:25:18,230 Rather than going haywire. Who would have thought of that? 207 00:25:18,830 --> 00:25:24,670 You know, every doctor that I ask would tells me if the vital signs stabilise, it's time to go home. 208 00:25:24,680 --> 00:25:28,760 Patients are doing well. It's exactly the other way around with premature babies. 209 00:25:29,270 --> 00:25:33,829 And the other kicker is Dr. Callum McGregor, who is saving babies. 210 00:25:33,830 --> 00:25:38,510 They're. Her doctorate is not in medicine. 211 00:25:38,720 --> 00:25:45,820 She's a computer scientist. So what this tells us is that we need to be humble. 212 00:25:47,850 --> 00:25:53,510 Vis a vis the world around us. Because we understand less than we think we understand. 213 00:25:54,980 --> 00:26:01,460 And we need to take this on board and we need to see whether we can let the data speak in order to understand the world. 214 00:26:02,340 --> 00:26:09,070 With more of its complexity. Now go back to the Google flu trends prediction. 215 00:26:09,700 --> 00:26:17,470 You may have heard in the in the media that Google flu trends were off, were wrong in December of 2012. 216 00:26:17,710 --> 00:26:23,470 Here on the very right hand side, you see the spike. Google flu trends predicted more flu cases than there actually were. 217 00:26:24,540 --> 00:26:35,160 What was going on. What was going on was actually something that is crucial and brings in two Europeans, Mr. Bull and Mr. Bayes. 218 00:26:37,040 --> 00:26:43,130 Because if you ask somebody, what's the chance when you throw a coin that it lands, heads up. 219 00:26:44,280 --> 00:26:51,120 Most people will tell you 50%. That's a really good approximation, but it is actually wrong. 220 00:26:52,720 --> 00:26:57,670 Because if you throw a particular coin, it's not 5050. 221 00:26:57,970 --> 00:27:02,110 It never is. Everybody throws slightly differently. Every coin is slightly different. 222 00:27:05,230 --> 00:27:08,320 We just have an approximation 5050 works pretty okay. 223 00:27:08,590 --> 00:27:15,760 Like Newton's gravitation law was pretty okay until we needed to have GPS and then we need it to go beyond that. 224 00:27:15,760 --> 00:27:24,219 But. But it's the same here. And so we need to whenever we have more data available and new data available, 225 00:27:24,220 --> 00:27:28,900 feed that in and rethink our view of the world, our perspective of the world. 226 00:27:29,530 --> 00:27:36,610 That's why the Bayesian idea of using priors is so fundamental to what big data is all about. 227 00:27:36,760 --> 00:27:43,239 And the mistake that Google made is that they once developed a model for flu trends in 2009 and 228 00:27:43,240 --> 00:27:49,389 stuck to it and never updating it based on the additional data that it had when they updated it. 229 00:27:49,390 --> 00:27:54,610 With 2010 and 2011 data, their December 2012 forecast was almost spot on. 230 00:27:57,580 --> 00:28:03,400 So we need to abandon our idealised worlds and approximating reality through big data. 231 00:28:04,090 --> 00:28:08,650 No. This means. 232 00:28:09,610 --> 00:28:20,290 That what is happening here at the Mathematical Institute is core is essential because we need to develop almost a new language, 233 00:28:20,290 --> 00:28:24,160 a new way of thinking, a new toolset for this big data world. 234 00:28:25,510 --> 00:28:27,850 And you are doing it here. 235 00:28:29,610 --> 00:28:37,470 And when we look at it, most of the tools that we have in the small data world of making sense of data, you know, like our square. 236 00:28:38,250 --> 00:28:46,470 Most of the software tools that we have like are developed here at Oxford were originally developed in a small data context. 237 00:28:47,700 --> 00:29:00,060 Now we need to find their equivalents in the big data. And that is, as I understand it, with the Oxford Nye NIE Lab will provide us. 238 00:29:01,530 --> 00:29:11,040 Cutting edge research and trying to help us enter this big data world with the right tools so that we can make sense of it. 239 00:29:11,910 --> 00:29:17,880 For that. Of course, we need data ification, that is to render every more aspect of our world into data form. 240 00:29:18,060 --> 00:29:19,890 We all need to do that location, right? 241 00:29:20,640 --> 00:29:28,020 I still remember that when you would take a car and drive into a new city and somebody would have a map on the lap sitting next to you. 242 00:29:29,600 --> 00:29:32,600 None of my students here in the audience ever would do that anymore. 243 00:29:33,680 --> 00:29:37,430 You did that? They asked me. Didn't you have sat, NAV? 244 00:29:38,180 --> 00:29:44,200 No. Location has been data fied and therefore we can analyse it. 245 00:29:44,800 --> 00:29:53,300 But it's not just that. In Japan, researchers have data five human behinds through a 30 sensors. 246 00:29:53,420 --> 00:29:59,330 They measured the size of the bum. Why? 247 00:30:01,920 --> 00:30:06,350 Never ask researchers that question. Why do they do that? 248 00:30:06,740 --> 00:30:11,710 Because they think. That every person has a different bum. 249 00:30:11,950 --> 00:30:15,730 They discover that the bum is as unique as a fingerprint. 250 00:30:16,180 --> 00:30:20,290 And so the idea is to use this as a car anti-theft device. 251 00:30:22,850 --> 00:30:26,510 You get into your car. Your car measures your bum. 252 00:30:27,080 --> 00:30:32,660 You can drive off. The thief gets into the car. 253 00:30:33,440 --> 00:30:39,160 Bum is measured. Bum is way too big. Car stops and the tracks. 254 00:30:40,430 --> 00:30:43,550 That's the data ification of another aspect of our reality. 255 00:30:43,670 --> 00:30:46,130 Now, you know, of course, this this is Google Glasses. 256 00:30:46,640 --> 00:30:52,280 In this version, Google Glass doesn't do what I think Google Glass is going to do and where Google is investing in it, 257 00:30:52,280 --> 00:30:56,180 and that is to data for the human gaze. What are we looking at? 258 00:30:56,510 --> 00:31:03,170 Can you imagine how valuable it is to know what people, what human beings are looking at like, what advertisements they are looking at, 259 00:31:03,410 --> 00:31:07,850 what are they looking at in the shopping window, what men are looking at when they walk down the street? 260 00:31:08,600 --> 00:31:12,960 Well, skip the last one. We know that. Cars. 261 00:31:15,920 --> 00:31:23,930 Data ification that is rendering ever more aspects of our reality into data form permits us to then extract value. 262 00:31:23,930 --> 00:31:33,170 And if you take anything about the value proposition of big data away from this talk, please take away the fact that in the small data rich, we. 263 00:31:34,630 --> 00:31:39,010 Use data for a particular purpose for which it was collected, and then we threw it away. 264 00:31:39,850 --> 00:31:45,850 And the big data age that we understand that the value of data is not exhaustive by using it once, 265 00:31:46,180 --> 00:31:50,050 but we can reuse it and reuse it and reuse it and reuse it for multiple purposes. 266 00:31:50,890 --> 00:31:54,370 Extracting more and more data, much like a iceberg. 267 00:31:55,350 --> 00:32:02,670 With data. Most of the latent value has been untapped so far. 268 00:32:02,850 --> 00:32:12,930 But we can tap into it to reuse. For example, global financial payment company Swift, that is transferring money across borders. 269 00:32:13,470 --> 00:32:21,990 Swift discovered that it can use its data to predict the health of local economies because of the correlation that it found. 270 00:32:22,560 --> 00:32:26,639 That's the re-use of data that it has. Or Start-Up Company. 271 00:32:26,640 --> 00:32:37,680 INRIX in Seattle helps over 100 million users every single working day to find their way to work or back with their car around traffic jams. 272 00:32:37,860 --> 00:32:41,310 Creating heat maps like this. Telling them where there is heavy traffic. 273 00:32:41,910 --> 00:32:46,270 Now. SAT maps do that right. 274 00:32:46,960 --> 00:32:52,000 But my satnav is stupid. My satnav tells me that there is a heavy traffic. 275 00:32:53,030 --> 00:32:58,850 When I'm already in it. This knows when heavy traffic is forming. 276 00:32:59,090 --> 00:33:02,750 Why? How? Because they have the data. How do they get the data? 277 00:33:03,320 --> 00:33:11,479 Because every one of their 100 million users is a sensor sending back data on where they are, how fast they are going. 278 00:33:11,480 --> 00:33:15,380 And so. And then, you know what? INRIX found out? 279 00:33:15,560 --> 00:33:22,459 That they can reuse the data. And they teamed up with a hedge fund because it turns out that there is a correlation between heavy 280 00:33:22,460 --> 00:33:29,420 traffic on weekends around shopping malls and revenue numbers of the shops in the shopping malls. 281 00:33:29,720 --> 00:33:35,390 So they are buying or selling stocks before quarterly earnings results based on the predictions. 282 00:33:35,630 --> 00:33:42,290 That, too, is a reuse of data. And when you have so much reuse happening, you can rethink your business model. 283 00:33:42,800 --> 00:33:47,540 Take Royce Weiss, not the luxury car company, but the jet engine producer. 284 00:33:47,540 --> 00:33:51,569 Number two, world's jet engine producer. Royce. 285 00:33:51,570 --> 00:33:55,590 Royce used to produce and sell jet engines. 286 00:33:55,740 --> 00:33:59,670 These jet engines have lots of sensors in there that measure vibration and temperature and 287 00:33:59,670 --> 00:34:04,680 pressure and so forth and send it to a computer in the jet engine that manages the jet engine. 288 00:34:05,310 --> 00:34:08,850 And then the data are thrown away with the Airbus 380 engine here. 289 00:34:08,850 --> 00:34:17,820 You see, when they discover that they can actually capture the data and then send it back to Royce Royce headquarters once the plane has landed. 290 00:34:18,720 --> 00:34:24,810 They do that. It's an enormous amount of data, a couple of gigabytes per plane, per plane ride. 291 00:34:25,710 --> 00:34:33,930 But what do they do with it? They do an analysis to find patterns that show them when a part in the jet engine is breaking before it actually breaks. 292 00:34:34,950 --> 00:34:43,229 So they can do what is called predictive maintenance, which is great because you can do them the maintenance before the part breaks. 293 00:34:43,230 --> 00:34:46,350 That is when the plane is on the ground. That helps, 294 00:34:46,860 --> 00:34:53,969 but it also helped tremendously Royce West to change its business model around from a company selling 295 00:34:53,970 --> 00:35:00,510 jet engines to a company selling fixed fee maintenance contracts and going into the service sector. 296 00:35:00,720 --> 00:35:05,250 Today, they are now 70% of their revenue derived from services. 297 00:35:06,200 --> 00:35:12,710 Many of you will now look at this and say this means only that the big get bigger, the Googles, the Facebooks and so forth. 298 00:35:13,040 --> 00:35:19,309 And there is some truth to it because the big ones are buying up data ingestion platforms. 299 00:35:19,310 --> 00:35:25,790 Remember that Google bought a thermostat company earlier this year called Nest for almost $3 billion. 300 00:35:26,120 --> 00:35:29,360 A thermostat company. Give me a break. Thermostat company? 301 00:35:30,730 --> 00:35:32,930 But Google didn't buy a thermostat company. 302 00:35:32,950 --> 00:35:43,030 Google bought a data ingestion platform that collects data about how cold or warm people want their rooms in their houses to be. 303 00:35:43,420 --> 00:35:46,600 And that was worth, in their mind, $3 billion. 304 00:35:46,960 --> 00:35:58,000 So. So big data we will have in this context in the financial services industry to improve forecast and decision making in the industry. 305 00:35:58,030 --> 00:35:59,320 We already see that happening. 306 00:35:59,650 --> 00:36:07,750 We also have it to stimulate innovation generally because with big data there will be many businesses coming up with many new ideas. 307 00:36:08,710 --> 00:36:22,150 But also it will make the financial information sector itself into a data platform, which is precisely what Bill NI and FTT have been doing. 308 00:36:23,020 --> 00:36:33,040 So it's not just that the big get bigger, but that there is a place for Start-Up companies, for vibrant new entrepreneurs. 309 00:36:33,790 --> 00:36:41,440 Think of the site dot com a company that predicts whether for 50,000 consumer goods, what are the prices going up or down? 310 00:36:41,530 --> 00:36:45,520 And they are so convinced that their prediction is right that if you buy the product based on their prediction. 311 00:36:47,210 --> 00:36:49,400 And it's wrong. They will refund you the difference. 312 00:36:51,600 --> 00:37:01,080 This is the brainchild of Oren Etzioni, and they do have the desired outcome, had hundreds of thousands of customers every day, 313 00:37:01,140 --> 00:37:06,360 billions of data points in order to do the calculation of the prediction for 50,000 consumer products. 314 00:37:06,810 --> 00:37:12,430 You know how many employees they have. 30, including the cleaning lady. 315 00:37:14,050 --> 00:37:21,330 How many servers? Zero because they all do it in the cloud. 316 00:37:22,600 --> 00:37:27,580 And that means that there isn't a huge investment anymore necessary to start this up. 317 00:37:27,760 --> 00:37:36,670 So we'll see a lot of activity on the small end of the spectrum as well, with nimble companies being extremely successful. 318 00:37:37,920 --> 00:37:43,080 Now, let me briefly tell you that this also changes how we internally operate. 319 00:37:45,670 --> 00:37:50,290 In organisations. I'm not talking about 50 Shades of Grey. 320 00:37:50,710 --> 00:37:59,740 I'm talking about 41 Shades of Blue. That is a few years back, Google had to decide what kind of colour to use for its Google search box. 321 00:38:01,510 --> 00:38:06,820 And the designer used a particular colour and his supervisor said, Why does colour? 322 00:38:06,860 --> 00:38:11,530 And he said, Because I'm the designer. And she said, Did you do a test? 323 00:38:11,530 --> 00:38:15,610 And he said, No. If you want to do me a test, I will resign. She said, Resignation accepted. 324 00:38:17,410 --> 00:38:23,530 And then they did a test. And then they found out testing 41 different shades of blue, that there was a slightly different shades of blue. 325 00:38:24,530 --> 00:38:31,100 That worked better. And gave Google $12 million more in annual revenues ad revenues per year. 326 00:38:32,580 --> 00:38:38,639 So the person who fired that, chief designer, Marissa mayer, now chief of Yahoo! 327 00:38:38,640 --> 00:38:49,050 Said this was the best business decision she ever made. It means that self-styled experts will be questioning. 328 00:38:50,850 --> 00:38:54,450 What's your email? What's the factual basis? 329 00:38:54,750 --> 00:38:59,670 They will be asked for your assumptions, for your suggestions. 330 00:39:01,270 --> 00:39:06,730 You know, and that especially is valid in the financial services industry. 331 00:39:07,180 --> 00:39:17,020 Companies like Imaginative in the Silicon Valley are looking at creating interesting trading platforms for specialised products. 332 00:39:17,020 --> 00:39:24,370 But then there is FTT, which creates with its big data set, 333 00:39:24,370 --> 00:39:30,880 not just a platform for people to learn, but for those to teach to people what they are learning. 334 00:39:31,300 --> 00:39:40,760 Learn as well, learn about the learning, and therefore create an environment that keeps people engaged and interested. 335 00:39:40,850 --> 00:39:45,960 Just amazing. But then I'm sure you have also heard of Target. 336 00:39:47,150 --> 00:39:54,530 This superstore in the United States that was able to predict with a relatively good degree of likelihood. 337 00:39:56,080 --> 00:40:02,320 Based on transactions, shopping transactions that one of their customers was pregnant. 338 00:40:03,650 --> 00:40:09,290 Even though that customer might not herself know if the pregnancy and. 339 00:40:11,590 --> 00:40:20,970 Oh. And so there is a dark side of big data, too, and we need to mention that. 340 00:40:22,160 --> 00:40:26,330 Many of you now think of George Orwell's 1984 and Surveillance Society. 341 00:40:26,330 --> 00:40:31,680 Yes, yes, yes, yes. Snowden, I get it. But this is only half the problem. 342 00:40:31,740 --> 00:40:40,410 The other half is that with predictions, we are also able perhaps to predict human behaviour. 343 00:40:41,380 --> 00:40:49,630 And as we predict human behaviour, we might be tempted to hold people responsible not for what they have done but what they are only predicted to do. 344 00:40:50,440 --> 00:40:56,170 Now if you think of the Hollywood movie Minority Report, that's precisely what I'm thinking of, too. 345 00:40:56,650 --> 00:41:04,570 In 30 states, in the United States, the decision of whether or not you are coming free out of prison on a parole. 346 00:41:05,610 --> 00:41:12,510 Is being made in part by a big data algorithm that is predicting whether or not you're going to be a criminal in the next 12 months. 347 00:41:13,800 --> 00:41:20,650 Is this going to be the end of free will? Because everything will be predicted about human behaviour. 348 00:41:23,950 --> 00:41:30,940 We need to be very aware of those ethical dilemmas, but we need to also keep in mind that the problem here is not big data. 349 00:41:32,050 --> 00:41:36,040 The problem is how we use the results of big data. 350 00:41:36,490 --> 00:41:42,459 Correlations are not causality using correlational insights for causal purposes. 351 00:41:42,460 --> 00:41:46,570 For example, to assign responsibility is abusing big data. 352 00:41:47,260 --> 00:41:56,110 It's stupid, it's wrong. You know, in the United States, they did a big data analysis to decide what kind of car has the least repairs, 353 00:41:56,110 --> 00:42:09,860 and it turns out it's a car of the colour orange. Half of you are already thinking why? 354 00:42:10,370 --> 00:42:13,190 Right. Is it because an orange car is more visible at night? 355 00:42:13,340 --> 00:42:17,900 Is it because our owner of an orange cause drives more carefully because it's a special car? 356 00:42:18,170 --> 00:42:24,260 Is it because it was manufactured specially? Timeout guys and gals, time out. 357 00:42:24,470 --> 00:42:36,110 The data doesn't tell us. The moment you start imbuing data with more meaning than it has that moment you succumb to the dictatorship of data. 358 00:42:38,210 --> 00:42:42,260 And we need to be aware of that. So let me come to a conclusion. 359 00:42:43,820 --> 00:42:47,420 Big data is going to change how we make decisions, how we live, 360 00:42:47,420 --> 00:42:55,400 work and think from how we learn to what kind of medications we are getting to how cars drive themselves. 361 00:42:56,210 --> 00:42:59,340 But big data also has. A number of challenges. 362 00:42:59,370 --> 00:43:06,659 So what is really important is as we enter this big data age that we where we remain fully in control of that technology, 363 00:43:06,660 --> 00:43:09,810 aware of its constraints and its limitations. 364 00:43:11,320 --> 00:43:19,710 That. As much as. We will utilise the insights from big data. 365 00:43:19,890 --> 00:43:24,990 We also protect and preserve a space for the human. 366 00:43:26,420 --> 00:43:35,309 For creativity, originality. For irrationality, for sometimes acting in defiance of what the data says. 367 00:43:35,310 --> 00:43:41,320 Because at the end of the day. Data is just a shadow of reality. 368 00:43:42,490 --> 00:43:47,630 And therefore it is always incomplete. And always a little bit incorrect. 369 00:43:48,080 --> 00:43:55,130 And so we need to approach this new big data world with a lot of humility rather than hubris. 370 00:43:56,000 --> 00:43:59,270 And we need to do so with a lot of humanity. 371 00:43:59,840 --> 00:44:00,710 Thanks very much.