1 00:00:06,890 --> 00:00:10,720 Dear all, my name is Therese Hopfenbeck. 2 00:00:10,790 --> 00:00:19,820 I'm the Director of Oxford University's Centre of Educational Assessment, and it gives me a real pleasure to welcome you all here today. 3 00:00:20,720 --> 00:00:28,310 I would particularly welcome Professor Art Graesser from US and Professor Sandra Milligan from Australia. 4 00:00:28,940 --> 00:00:36,980 We're extremely glad to have you here. And we also have our own Professor Josh McGrane, which we're really proud of having as one of the speakers. 5 00:00:37,580 --> 00:00:40,630 I will tell more about all of you when I introduce you, 6 00:00:41,480 --> 00:00:50,330 but I will say a particular welcome to all our guests from different universities, nationally and also internationally, 7 00:00:50,840 --> 00:00:58,430 because we have Professor Nils Gilje, Professor Sølvi Lillejord, who flew in from Norway just to be part of this celebration. 8 00:00:58,430 --> 00:01:05,120 And Sølvi has been a longstanding supporter of the centre and also a Fellow of our Department for Education. 9 00:01:05,120 --> 00:01:07,100 So we're really, really happy that you're here. 10 00:01:08,240 --> 00:01:18,170 We would also like to particularly say we're very pleased to have our president from Kellogg College, Jonathan Michie, welcome. 11 00:01:18,260 --> 00:01:25,340 We're really pleased that you're here. And we also would like to say a special welcome to all our DPhil students and master students. 12 00:01:25,700 --> 00:01:36,080 Recognised students, Sarah from Australia, from OUCEA in our department, because you are what makes Oxford a very special place to be. 13 00:01:36,620 --> 00:01:41,570 And we are pleased to be offering face to face events, although a bit hot, 14 00:01:42,620 --> 00:01:47,479 because after the pandemic it's been a lot online and I really hope that you will have 15 00:01:47,480 --> 00:01:52,790 time now in Oxford to thrive and meet each other in person and experience that too. 16 00:01:54,420 --> 00:02:00,390 It is a special welcome this year as we are now celebrating the first OUCEA Annual Lecture since 2019, 17 00:02:01,290 --> 00:02:07,710 before the pandemic and our centre, which was fully funded by a generous funding from Pearson in 2008, 18 00:02:08,100 --> 00:02:15,900 has been a small but thriving centre and each year we have had the pleasure of hosting some of our world leading professors to give a talk. 19 00:02:16,710 --> 00:02:25,320 Previous lectures have included David Andrich from Australia, Gordon Stobart, UCL UK, Pam Sammons, Oxford UK, Nancy Perry, 20 00:02:25,350 --> 00:02:32,250 British Columbia, Canada and David Pearson, Berkeley, U.S. and Derek Briggs from Colorado Boulder, US. 21 00:02:33,600 --> 00:02:37,700 Little did we know that we had to wait two years to celebrate again. 22 00:02:38,810 --> 00:02:45,200 Therefore, we will allow us to celebrate in appreciation of the funders who have believed in us through the pandemic 23 00:02:45,710 --> 00:02:53,570 And we celebrate our national and international collaborators who stood by us and achieved so much with us in the past challenging years. 24 00:02:54,590 --> 00:03:01,610 I would like to thank you all. The pandemic and lockdown has been hard in different ways for all of us. 25 00:03:02,240 --> 00:03:07,520 And living with uncertainty in academia has a challenge, was a challenge we had to tackle. 26 00:03:08,720 --> 00:03:13,580 I'm proud to say we were able to get through it, although many will say we're still in it. 27 00:03:14,510 --> 00:03:19,310 And I would like to take a few moments to particularly than the team at OUCEA for their commitment, 28 00:03:19,670 --> 00:03:26,600 dedication, good spirit and positivity which led us to successfully secure grants. 29 00:03:26,870 --> 00:03:34,490 Collect data, write research reports, publish articles, and present in different international conferences. 30 00:03:36,110 --> 00:03:42,440 I will use as an example that here today we have Dr. Samantha-Kaye Johnston present. 31 00:03:42,830 --> 00:03:48,260 She worked hard online for more than a year before we actually met in person. 32 00:03:48,410 --> 00:03:51,950 And she represents this kind of team spirit, which I'm talking about. 33 00:03:53,460 --> 00:04:03,090 A particular thank you and appreciation will go to Deputy Director Josh McGrane, who led OUCEA together with me during these challenging times. 34 00:04:03,960 --> 00:04:12,000 We have been able to find a way where collaboration and making our team succeed no matter what was the key to all we did. 35 00:04:12,810 --> 00:04:16,890 And you use the words leading with integrity and compassion, Josh. 36 00:04:17,850 --> 00:04:22,350 And that is what we do and we'll continue to do. 37 00:04:22,890 --> 00:04:27,170 And thank you for that. So these are values I strongly believe in. 38 00:04:27,180 --> 00:04:32,100 And also one reason why Kellogg College became such an important part of our academic life. 39 00:04:32,880 --> 00:04:37,020 We are very thankful to have Kelly President Jonathan Michie here today celebrating with us. 40 00:04:37,470 --> 00:04:43,050 Jonathan is now a Fellow of the Department of Education here in Oxford and an important collaborator for us. 41 00:04:43,350 --> 00:04:51,240 As both Josh and I are members of the college and have attended seminars and events to support their mission on sustainability and education. 42 00:04:51,960 --> 00:04:58,420 You can hardly think of anything more important. And we will continue to support the college in any way we can. 43 00:04:59,190 --> 00:05:07,640 And for those of you visiting Oxford, if you do have time, walk to Kellogg and see the wonderful gardens, take a cup of tea or something in the Hub. 44 00:05:07,650 --> 00:05:14,730 It's beautiful there. But for now, let me introduce our distinguished guests. 45 00:05:14,760 --> 00:05:20,489 Art Graesser is a Professor Emeritus in the Department of Psychology and the Institute of Intelligence 46 00:05:20,490 --> 00:05:25,050 Systems at the University of Memphis and Honorary Research Fellow at the University of Oxford. 47 00:05:25,830 --> 00:05:30,420 His research is in discourse processing, cognitive science and education. 48 00:05:30,720 --> 00:05:37,050 And personally, I had the pleasure of meeting you the first time in Melbourne in 2010 when we were both working on PISA. 49 00:05:38,100 --> 00:05:41,370 He has developed software and learning language and discourse technologies, 50 00:05:41,370 --> 00:05:46,550 including systems that hold a conversation in natural language with computer agents, AutoTutor, 51 00:05:46,570 --> 00:05:51,990 and that analyse text on multiple levels of language and discourse Coh-Metrix 52 00:05:52,830 --> 00:05:59,280 He served as editor of Discourse, Process and Journal of Education Psychology as president of Society for Text and Discourse, 53 00:05:59,280 --> 00:06:06,540 International Society for Artificial Intelligence in Education and Federation of Associations in Behavioural and Brain Sciences. 54 00:06:06,840 --> 00:06:13,290 And as a member of four panels with the National Academy of Science and four OECD expert panels on problem solving. 55 00:06:13,620 --> 00:06:18,140 PIAAC 2011-21, PISA 2012 2015. 56 00:06:18,840 --> 00:06:22,950 He has received lifetime achievement awards from the American Psychological Association, 57 00:06:22,950 --> 00:06:32,240 McGraw-Hill Education Society for Artificial Intelligence in Education Societies with Text Discourse and University of Memphis, Art. 58 00:06:33,030 --> 00:06:43,050 the floor is yours. And we're so happy to have you. So 100 degree heat Fahrenheit, right? 59 00:06:44,900 --> 00:06:51,260 Well, I come from Memphis and we've had 100 degree heat now for six weeks. 60 00:06:52,190 --> 00:07:01,430 And so I'm used to it if you need any, you know, sort of solutions on how to survive it, just ask me during the break. 61 00:07:03,780 --> 00:07:09,330 So I retired three years ago, September of 2019. 62 00:07:11,790 --> 00:07:19,290 September 2019. A month later, I found out my wife had endometrial cancer. 63 00:07:21,760 --> 00:07:32,280 About a couple of months after that COVID came. The next city I was supposed to go to was a place called Wuhan, China. 64 00:07:33,180 --> 00:07:43,440 Have you ever heard of that? I couldn't go because of my wife's cancer, so I keep telling my wife her cancer saved me. 65 00:07:45,570 --> 00:07:52,020 So, some people think I'm not going to retire. 66 00:07:52,050 --> 00:07:56,090 And in fact, I'm at a phase called transitioning. 67 00:07:58,200 --> 00:08:07,650 And so, during the transition, what I've been doing is participating in a number of large scale projects. 68 00:08:08,490 --> 00:08:12,120 I'm not leading any of them. I'm just a team member. 69 00:08:12,720 --> 00:08:19,770 But at this phase in my life, I want to help these large-scale projects succeed. 70 00:08:20,340 --> 00:08:21,930 And what I want to do is just give you. 71 00:08:23,240 --> 00:08:33,980 Very succinct highlights on some of some of these, all of the work is current so I'm not going to delve into the past. 72 00:08:35,120 --> 00:08:40,360 There are three important themes. Technology. 73 00:08:40,600 --> 00:08:44,260 Yes. Multiple disciplines. 74 00:08:44,710 --> 00:08:53,620 Not only being interdisciplinary, but inter-institutional and the intercontinental and not quite interstellar yet. 75 00:08:55,960 --> 00:09:04,970 And also 21st century skills in addition to the maths and the literacy and science, 76 00:09:05,530 --> 00:09:10,720 because 21st century skills, as we know, are an important part of the workforce. 77 00:09:11,560 --> 00:09:16,810 And so we need to change our curriculum to accommodate that. 78 00:09:17,080 --> 00:09:20,620 And assessment is one way to drive that. 79 00:09:22,630 --> 00:09:29,350 This is a recent article I have in the Annual Review of Psychology. 80 00:09:31,090 --> 00:09:41,620 I was asked to cover Educational Psychology in 25 pages, so I decided to focus on this particular slant. 81 00:09:43,410 --> 00:09:51,470 So this is an overview. Organised around really the funders. 82 00:09:53,390 --> 00:09:59,930 The first is the OECD projects. I assume everybody knows what PISA and PIACC are. 83 00:10:00,350 --> 00:10:05,870 Does anybody not know what PISA? Raise your hand if you don't know what PISA and PIAAC are. 84 00:10:08,450 --> 00:10:11,810 Well, we'll talk later. Yeah, right. So. 85 00:10:15,580 --> 00:10:24,960 Then I'm going to talk about funding for adult learning out of the Institute of Education Sciences of the US Department of Education. 86 00:10:24,970 --> 00:10:33,210 That's a major research arm of the US Department of Education, and it's all adult learning. 87 00:10:33,280 --> 00:10:36,910 I know for decades they concentrated on K-12, 88 00:10:37,540 --> 00:10:46,899 but it's only recently that they're branching out to adult learning because there are a lot 89 00:10:46,900 --> 00:10:55,570 of struggling adults who can't keep pace with the skills that are needed in the 21st century. 90 00:10:57,540 --> 00:11:00,779 Then a couple of things from the Department of Defence. 91 00:11:00,780 --> 00:11:05,819 The Department of Defence has really led a lot of the learning environments and 92 00:11:05,820 --> 00:11:11,670 assessment in environments because they deal with adults in the military. 93 00:11:11,910 --> 00:11:15,120 And so they've funded a large number of projects. 94 00:11:16,320 --> 00:11:23,310 And then finally, I want to talk a little bit about the NSF funded Learning Data Institute. 95 00:11:25,250 --> 00:11:28,760 So let me start with the OECD projects. 96 00:11:30,250 --> 00:11:33,910 As you know, they have these assessments throughout the world. 97 00:11:34,150 --> 00:11:38,620 You have a number of countries who consistently buy into this. 98 00:11:39,070 --> 00:11:42,430 Others that may some years, but not others. 99 00:11:43,030 --> 00:11:46,390 And the countries have to pay for it. 100 00:11:48,580 --> 00:11:55,870 The whole concept is, of course, that OECD is really interested in the economies of these countries. 101 00:11:56,410 --> 00:12:03,370 And education is a big part to predict the success of economies. 102 00:12:04,030 --> 00:12:15,530 And. Assessment hopefully will drive the curriculum to improve the education of the people in these countries. 103 00:12:16,550 --> 00:12:22,910 As you know, there are these comparisons amongst countries, but you're not supposed to worry about that. 104 00:12:24,500 --> 00:12:32,180 Instead, it's supposed to be driving countries over time in their planning on education and curriculum. 105 00:12:35,450 --> 00:12:44,960 The project I've been recently involved in is the adaptive problem solving for PIAAC 2021. 106 00:12:46,320 --> 00:12:54,030 Actually it's the data are being collected in 2022 because of the COVID and so. 107 00:12:57,540 --> 00:13:03,900 The key thing with adaptive problem solving is the problems are not static. 108 00:13:05,440 --> 00:13:11,290 Traditionally in problem solving, you have a set of givens and then a goal state, 109 00:13:11,560 --> 00:13:18,490 and then you try to figure out how to solve the problem to get you from the givens state to the goal state. 110 00:13:18,730 --> 00:13:25,000 Right. Well, in adaptive problem solving, the problems change in the middle. 111 00:13:27,040 --> 00:13:32,380 And so you have to change your strategy. So strategy change is an important part. 112 00:13:34,220 --> 00:13:41,390 You have to have better metacognition in order to track what you know and how you deal with it. 113 00:13:42,920 --> 00:13:46,370 And so metacognition is a big part of the assessment. 114 00:13:49,190 --> 00:13:53,360 So they're in the middle of collecting data on this. 115 00:13:54,170 --> 00:14:06,950 So I can't really show you any data. But Samuel Greiff  at Luxembourg is the he and his group are leading the charge. 116 00:14:07,340 --> 00:14:12,560 And then Educational Testing Service is playing the big role in collecting the data. 117 00:14:14,140 --> 00:14:19,990 And so maybe at some point I can talk a little bit about that in the future. 118 00:14:21,520 --> 00:14:24,820 So the other thing I wanted to mention is. 119 00:14:28,000 --> 00:14:35,930 A project led by a couple of projects led by Stuart Elliot. 120 00:14:37,000 --> 00:14:49,570 He's been part of OECD. He also is affiliated with the National Academy of Sciences, Education and Medicine in the United States. 121 00:14:50,030 --> 00:14:53,620 But so what are these reports about? 122 00:14:54,610 --> 00:15:03,460 Well, everybody's worried about people not having the right skills in the future. 123 00:15:04,600 --> 00:15:10,960 And they're also worried about A.I. and robotics tech taking over jobs. 124 00:15:12,220 --> 00:15:20,740 And so one of the projects that I was participated in had a bunch of the experts, 125 00:15:20,740 --> 00:15:38,470 11 of us all about AI and robotics are actually take consider a computer taking the test the tests on literacy, numeracy and problem solving. 126 00:15:39,190 --> 00:15:43,000 And imagine how well the computers would do. 127 00:15:44,360 --> 00:15:54,620 And so each of us, you know, tried to make judgements on all the items and justify our decisions. 128 00:15:55,670 --> 00:16:02,780 And if you have a project or a problem like this windmill problem where it's a 129 00:16:02,780 --> 00:16:08,360 problem saying how many windmills would be needed to replace the nuclear reactor? 130 00:16:08,810 --> 00:16:12,230 And so you have to read the text and understand it. 131 00:16:12,470 --> 00:16:16,130 You have to sometimes look at pictures and interpret them. 132 00:16:16,430 --> 00:16:24,319 You have to understand a table that's down below or a table of data, and you have to integrate all that. 133 00:16:24,320 --> 00:16:28,010 in order to interpret the question and answer the question. 134 00:16:28,730 --> 00:16:33,560 And so you had 11 experts do this and they analysed the data. 135 00:16:34,490 --> 00:16:40,490 They divided the experts into the optimists, the realists and the pessimists. 136 00:16:42,020 --> 00:16:45,680 Well, I was categorised as a pessimist. 137 00:16:46,910 --> 00:16:58,010 Now, we recently redid this about a about four months ago, and the same experts and a few more came and analysed the data. 138 00:16:58,010 --> 00:17:01,070 And I'm now a realist. 139 00:17:01,520 --> 00:17:09,620 So I guess there's either more progress, either I'm better understanding AI or AI has changed. 140 00:17:09,980 --> 00:17:14,090 But interesting project and you can read about it. 141 00:17:15,810 --> 00:17:20,580 The other project that Stuart Elliot is leading is. 142 00:17:22,300 --> 00:17:29,560 Trying to figure out what skills. Well, first of all, what are what is a good taxonomy of skills? 143 00:17:30,670 --> 00:17:38,530 And so he got together people in psychology, psychometrics, people certifying people for jobs. 144 00:17:40,630 --> 00:17:44,470 He also AI people in AI human factors. 145 00:17:44,650 --> 00:17:47,140 They got an ensemble of people. 146 00:17:47,650 --> 00:17:58,630 I was fortunate to be asked to be one of them to and there's a book on that reflects on what they think a good taxonomy of skills are. 147 00:17:59,500 --> 00:18:12,910 And in addition to identifying that taxonomy and the skills of making judgements on what can computers do versus people? 148 00:18:13,800 --> 00:18:17,340 What should people be doing as opposed to computers? 149 00:18:18,120 --> 00:18:27,060 And these are judgements by experts. No real hard data, but kind of interesting and a lot of interesting perspectives. 150 00:18:27,510 --> 00:18:30,899 You can download these from the OECD website. 151 00:18:30,900 --> 00:18:37,470 As you know, the website has, I guess, thousands of documents that you can download. 152 00:18:37,650 --> 00:18:40,650 Some of them you get for free, others you have to pay for now. 153 00:18:40,650 --> 00:18:44,310 But it's fascinating because. 154 00:18:45,460 --> 00:18:55,690 Many people have argued that in the future we want to better understand what should the computer do versus humans do? 155 00:18:56,870 --> 00:19:00,230 And there's a lot of counterintuitive findings. 156 00:19:02,480 --> 00:19:07,190 And exploring that is important. 157 00:19:07,430 --> 00:19:09,340 And this is really the first volume. 158 00:19:09,390 --> 00:19:15,590 They're going to have other volumes in the future, because this is a pressing questions question that people have. 159 00:19:15,860 --> 00:19:19,280 A lot of people are terrified that they're going to lose their jobs. 160 00:19:19,640 --> 00:19:30,530 And as you know, there's been a shift more towards jobs requiring reasoning and problem solving and collaboration, collaborative problem solving. 161 00:19:30,830 --> 00:19:36,920 I know Sandra and I were just talking about the collaborative problem solving, and I was involved in that in PISA. 162 00:19:38,370 --> 00:19:44,810 So so what do you want the computer to do as opposed to humans? 163 00:19:46,470 --> 00:19:51,330 So this is just another problem of. 164 00:19:52,720 --> 00:19:57,130 And, you know, just look you look at all the information here. 165 00:19:57,580 --> 00:20:03,400 A computer would have to identify which columns to get the information. 166 00:20:03,850 --> 00:20:08,860 It's asking which muscles would benefit most if you use the gym bench. 167 00:20:09,370 --> 00:20:18,180 And so you have to look under muscles and then you have to you have to drill down on the right cell in order to answer the question. 168 00:20:18,190 --> 00:20:21,970 And there's a lot of distracting information, and you have to handle that. 169 00:20:24,210 --> 00:20:33,390 Okay. Let me go on to the second part with the projects with the Institute of Education Sciences and Adult Learning. 170 00:20:35,010 --> 00:20:43,920 I was fortunate to be part of a large centre grant, the Centre for the Study of Adult Learning. 171 00:20:44,760 --> 00:20:49,080 That was led by Daphne Greenberg at Georgia State University. 172 00:20:49,110 --> 00:20:57,630 She's really one of the gurus who've been trying to help improve adult literacy throughout her career. 173 00:20:59,160 --> 00:21:05,250 Those are the Co-PI’s, and of that project was a five year project. 174 00:21:05,280 --> 00:21:09,150 It's now continuing under the direction of John Sabatini. 175 00:21:09,810 --> 00:21:14,040 John Sabatini has, we're fortunate. 176 00:21:14,490 --> 00:21:20,910 He joined us at University of Memphis. He used to be at Educational Testing Service, and in fact, 177 00:21:20,910 --> 00:21:35,040 he was the PI on the assessment of comprehension skills on a very large $100 billion investment from the Institute of Education Sciences. 178 00:21:35,790 --> 00:21:37,830 And so he knows a lot about assessment. 179 00:21:38,100 --> 00:21:48,180 But he wanted to break away from ETFs in order to do more research on learning and combining learning with assessment. 180 00:21:49,240 --> 00:21:56,709 Both of us believe that you can't really track learning without good assessment and you 181 00:21:56,710 --> 00:22:03,400 can't just have assessment without considering learning and that you see people nodding. 182 00:22:03,400 --> 00:22:10,270 I think you all agree. And so he if we're fortunate, he came in at the University of Memphis. 183 00:22:12,510 --> 00:22:19,360 So. we had an intervention and it was a hybrid intervention to treat. 184 00:22:22,350 --> 00:22:29,850 Reading skills and especially comprehension skills because 40 to 60 million people, 185 00:22:30,120 --> 00:22:35,790 adults in the United States, don't read at a deep enough level to get a decent job. 186 00:22:36,690 --> 00:22:45,430 They don't read at an eighth-grade level, for example. And in fact, we've collected data in colleges with psychometric tests. 187 00:22:45,840 --> 00:22:52,860 And 38% of the students in college don't read at an eighth-grade level. 188 00:22:53,640 --> 00:22:56,640 And so, you know, it's a problem. 189 00:22:56,640 --> 00:23:01,770 And that squares a way, I think, with the PIAAC and PISA of data, too, 190 00:23:03,120 --> 00:23:11,760 so now it's hard to train teachers how to teach comprehension skills, you can get. 191 00:23:12,990 --> 00:23:18,690 They do very well on decoding and vocabulary, but when it comes to comprehension, 192 00:23:18,690 --> 00:23:28,680 it's very variable and too often not science based and or even tangible ways to teach comprehension. 193 00:23:29,160 --> 00:23:35,760 It's hard. So what we did is built a system AutoTutor with these agents. 194 00:23:36,390 --> 00:23:46,830 And so imagine people reading text on a computer and having these agents, having conversations about the text with the person. 195 00:23:47,550 --> 00:23:57,720 And imagine periodically asking conversation based questions, just asking a question and see how they answer. 196 00:23:59,520 --> 00:24:03,059 Well, that's what we did with AutoTutor. Now. 197 00:24:03,060 --> 00:24:14,610 It was not natural language input because writing is hard and typing is hard and often they don't have the digital literacy skills. 198 00:24:15,450 --> 00:24:28,409 And in fact, both John Sabatini and I are convinced you have to combine the digital skill acquisition with reading comprehension. 199 00:24:28,410 --> 00:24:39,330 Both have to go hand in hand. One challenge is if you look at digital skill acquisition on the computer. 200 00:24:40,820 --> 00:24:47,440 Nearly all of it requires you to know how to read. Well, that's the problem you're trying to solve. 201 00:24:47,450 --> 00:24:51,740 So it's a real chicken and egg problem, a real challenge. 202 00:24:51,950 --> 00:25:00,460 And so we're now building digital training skills for people where you don't have to read much, you know? 203 00:25:03,410 --> 00:25:08,180 So they have these agents or we created a lot of lessons. 204 00:25:08,780 --> 00:25:12,139 Imagine actually we have 30 lessons. 205 00:25:12,140 --> 00:25:24,590 We created some of our own words and sentences like you have how they interpret pronouns or non-literal language or of learning new words. 206 00:25:25,280 --> 00:25:33,220 Then you have others on stories and texts and in many different kind of genres and subgenres. 207 00:25:33,230 --> 00:25:45,590 You get the narrative and expository and persuasive texts and compare and contrast, problem, solution, and each of these modules. 208 00:25:46,040 --> 00:25:50,380 It takes 20 minutes to an hour to complete. 209 00:25:50,900 --> 00:25:56,300 And so it's really a 20 to 30 hour intervention that they have. 210 00:25:57,140 --> 00:26:06,379 And so we built this system and and we were hoping not only to help the students learn better, 211 00:26:06,380 --> 00:26:15,770 but also instructors learn better because they're often not exposed to training for comprehension skills. 212 00:26:16,100 --> 00:26:23,090 Remember, in literacy centres, a lot of these people are volunteers from the community. 213 00:26:24,220 --> 00:26:35,260 And in the past the until recently not a lot of investment from the US government to train adults on reading comprehension skills. 214 00:26:35,740 --> 00:26:43,780 And so both instructors can learn and and also struggling adult readers can learn. 215 00:26:44,670 --> 00:26:55,540 So. This was a hybrid intervention with human instructors as well as AutoTutor on the computer. 216 00:26:56,490 --> 00:27:01,020 However, our part was the attitude part. 217 00:27:02,220 --> 00:27:05,160 AutoTutor can store all these data in the cloud. 218 00:27:06,480 --> 00:27:18,180 Whereas humans, you don't know what really what they're doing unless you had a very large budget to videotape them and transcribe it and analyse it. 219 00:27:18,420 --> 00:27:25,380 That would take forever, of course, whereas the computer, it's all in the cloud, you can immediately see it. 220 00:27:25,770 --> 00:27:27,659 And there's many things you can collect. 221 00:27:27,660 --> 00:27:37,080 You can collect reading time, how long it takes to read the text. accuracy in answering these conversation based questions. 222 00:27:37,320 --> 00:27:47,400 The time it takes to answer the question. Any learner initiatives on asking questions or choosing lessons to take. 223 00:27:49,350 --> 00:27:58,050 There's also measures of the students. You can collect background measures and it's all stored there in the database, on the cloud, in the cloud. 224 00:27:59,250 --> 00:28:09,930 We have psychometric tests, three of them on comprehension skills and other skills, and we analyse the lessons. 225 00:28:09,930 --> 00:28:17,220 We use this Coh-Metrix system to scale text on difficulty and many levels of language. 226 00:28:19,140 --> 00:28:24,740 And so. Lots of data. 227 00:28:25,190 --> 00:28:34,190 And so we've been mining it. In fact, next week I go to educational data mining and artificial intelligence in education. 228 00:28:34,200 --> 00:28:38,300 It's held in Durham. First time I've ever will be in Durham. 229 00:28:38,810 --> 00:28:42,740 And and so we'll be presenting some of our stuff there. 230 00:28:45,220 --> 00:28:54,880 We did a study and with about 253 people, these came out of Toronto and Atlanta, 231 00:28:55,690 --> 00:29:01,750 and these were struggling adult readers below the eighth-grade level. 232 00:29:02,440 --> 00:29:06,250 Typically, they read between the third and fourth grade level. 233 00:29:06,820 --> 00:29:19,030 Okay. So we analysed the data and with AutoTutor and we found by analysing the pattern 234 00:29:19,030 --> 00:29:26,620 of reading times and all of the time to answer questions and things like that. 235 00:29:27,670 --> 00:29:33,250 We did some data, we did some clustering analysis and we found four types. 236 00:29:33,490 --> 00:29:41,470 You have the higher performers. These are relatively that answer questions quickly and are pretty correct. 237 00:29:41,950 --> 00:29:51,390 And then you have conscientious and they're real beneficiaries and looking at effect size and some post-test minus pre-test. 238 00:29:52,900 --> 00:29:56,080 So you really want to leave them alone to do whatever they're doing. 239 00:29:56,710 --> 00:30:02,140 Then you have struggling readers. It's beyond the zone of proximal development for them. 240 00:30:02,650 --> 00:30:06,600 And our intervention was not working on them at all. 241 00:30:06,690 --> 00:30:10,960 You got to do something different. You got to scrap it and do something different. 242 00:30:11,380 --> 00:30:18,220 And then under engaged. Some people try to quickly do things and so they underachieve. 243 00:30:18,220 --> 00:30:23,200 And so you might ask them to have a deeper read, things like that. 244 00:30:24,330 --> 00:30:35,670 So what we're trying to do is mine the data now so we can have a recommender system in order to guide the learner on what to do next. 245 00:30:38,040 --> 00:30:42,070 Let me shift now to the Department of Defence Projects. 246 00:30:44,370 --> 00:30:49,259 Once again, the DOD, the Department of Defence, 247 00:30:49,260 --> 00:31:00,000 US Department of Defence has taken the lead over the last 30 years on building these intelligent tutoring systems. 248 00:31:01,680 --> 00:31:07,350 And other advanced learning environments, including virtual reality and augmented reality. 249 00:31:07,890 --> 00:31:17,550 And and concentrating on adults. And so we've been involved in projects with them in the last 30 years. 250 00:31:18,450 --> 00:31:21,540 Let me just mention a couple of. 251 00:31:21,900 --> 00:31:33,540 One is, during the last decade, we have been developing this generalised intelligence framework for tutoring. 252 00:31:34,320 --> 00:31:38,440 It's called GIFT, for short. 253 00:31:39,250 --> 00:31:45,910 And each year we gather a bunch of experts on some aspect of building these systems. 254 00:31:46,720 --> 00:31:54,940 And we get together and we write a book, there's a presentation, and we write a book and you can get the book for free. 255 00:31:55,840 --> 00:31:59,110 If you go to gifttutoring.org 256 00:32:01,120 --> 00:32:12,720 And it's really a good snapshot if you want to find out what the latest is on intelligent tutoring systems, and that involves team training, too. 257 00:32:14,350 --> 00:32:15,820 You can go to this site. 258 00:32:16,180 --> 00:32:29,770 Over 300 experts throughout the world have contributed to the GIFT or expert workshops and represent all sorts of advanced learning environments, 259 00:32:29,770 --> 00:32:36,760 not just intelligent tutoring systems. And so the first year, their focus was on learning or modelling. 260 00:32:37,780 --> 00:32:46,719 Then on instructional strategies, then on authoring tools, then domain knowledge, then assessment. 261 00:32:46,720 --> 00:32:53,770 The assessment what one was held at Educational Testing Service, then one on teams. 262 00:32:54,580 --> 00:33:02,350 We had people involved in the collaborative problem solving there, then self-improving systems. 263 00:33:02,350 --> 00:33:14,230 These are systems where the computer improves during the evolution of building a system and as well as learners improve and educators improve, 264 00:33:14,920 --> 00:33:27,040 it's kind of a co-involving learning on all parts then data visualisation and recently competency-based scenario design. 265 00:33:27,040 --> 00:33:28,869 So you're free to go there. 266 00:33:28,870 --> 00:33:42,220 This was a team team up between University of Memphis and the Army, and so we're real happy with that community of people building the systems. 267 00:33:42,670 --> 00:33:48,370 Traditionally, these systems have been expensive to build, very expensive. 268 00:33:48,370 --> 00:33:55,810 So part of the goal is to make them be created faster, cheaper, things like that. 269 00:33:58,410 --> 00:34:01,440 Now you might ask, why did they team with Memphis? 270 00:34:01,980 --> 00:34:10,920 That's because we're known in our Institute for Intelligent Systems for in building a lot of these systems with intelligent agents, 271 00:34:11,430 --> 00:34:17,009 these conversational agents. Some of them, they hold conversation in natural language. 272 00:34:17,010 --> 00:34:25,590 So you try to interpret what people say in natural language and respond to get them to productively learn. 273 00:34:26,310 --> 00:34:37,560 Now, I will say this the computer does not perfectly understand the human. Most humans don't perfectly understand other humans. 274 00:34:38,040 --> 00:34:41,940 So you want to compare them. But so. 275 00:34:42,030 --> 00:34:46,650 But they the agents can have dialogue moves to help people learn. 276 00:34:46,830 --> 00:34:47,700 That's the idea. 277 00:34:49,800 --> 00:34:58,350 One of the I'm going to talk about electronics tutor briefly, but the other one we like is the personal assistant for lifelong learning. 278 00:34:58,860 --> 00:35:05,870 Imagine if you had an agent that followed you throughout your career and collect all these information. 279 00:35:05,880 --> 00:35:10,260 It's almost like a learning portfolio, a digital learning portfolio. 280 00:35:10,680 --> 00:35:14,220 Imagine that. And then it comes up with recommendations. 281 00:35:14,760 --> 00:35:22,020 And the idea is, if you have to be an expert in a topic, it will give you a problem of the day. 282 00:35:22,980 --> 00:35:30,600 And that problem tries to patch misconceptions who might have had or advance your skill. 283 00:35:31,050 --> 00:35:35,130 So imagine a 20 minute a day refresher. 284 00:35:35,370 --> 00:35:38,580 So you stay current as an expert in your field. 285 00:35:39,300 --> 00:35:43,680 And that helps people prevent skill decay. 286 00:35:44,430 --> 00:35:52,770 That's a big problem, especially if you have a class where they throw all this information at you for 40 hours, 287 00:35:53,400 --> 00:36:01,470 distributed over whatever period of time. And if you look at people six months later, a lot of it is forgotten. 288 00:36:02,190 --> 00:36:07,020 So the goal is to have some activity to keep up your skills. 289 00:36:07,380 --> 00:36:18,900 And we actually prevented skill decay in an area of electronics, which is very important in the Department of Defence. 290 00:36:20,930 --> 00:36:25,570 This is one of the projects I led and it's still being used. 291 00:36:25,580 --> 00:36:39,860 It's actually ElectronixTutor is being used in the nuclear power command training for people who want to go into the Navy in that area. 292 00:36:40,460 --> 00:36:51,680 That's not nuclear bombs, by the way. It's nuclear power to energise the ships and so of lots of institutions. 293 00:36:52,390 --> 00:36:56,150 University of Southern California, Memphis was the lead. 294 00:36:56,480 --> 00:37:07,700 And then we had people from Arizona State University, WPI, we had industry involved from Raytheon and so on. 295 00:37:08,360 --> 00:37:15,170 This is the most co-authors I've ever had on an article because I wanted everybody included. 296 00:37:15,920 --> 00:37:22,640 And you've probably heard physics have 500 authors on some paper. 297 00:37:22,850 --> 00:37:27,170 I felt I was drifting towards that that issue. 298 00:37:28,640 --> 00:37:32,920 But. Imagine. 299 00:37:34,340 --> 00:37:40,720 A really a federation of intelligent tutoring systems and learning environments. 300 00:37:41,560 --> 00:37:49,780 Imagine just reading texts like the Navy documents or AutoTutor conversational. 301 00:37:50,620 --> 00:38:03,820 And then you have learn form doing reasoning with mathematics or assessments that trains people on like Kirchhoff law, very mathematical in Ohm's Law. 302 00:38:04,450 --> 00:38:13,690 And then you have dragoon a deep sort of mental models on electronics, very deep simulation sort of stuff. 303 00:38:13,990 --> 00:38:20,620 So it's a collection of intelligent tutoring systems, all integrated in one environment. 304 00:38:21,790 --> 00:38:25,840 And you track all the activities stored in the cloud. 305 00:38:26,260 --> 00:38:40,000 And and you kind of see hopefully the long-term vision is to see what what people will learn and what environments do they choose to use. 306 00:38:40,450 --> 00:38:43,870 Do they want a visual or simulation? 307 00:38:44,170 --> 00:38:54,940 Of course, we already know people try to avoid hard work and so they often don't wisely choose things if they do it in a self-regulated way. 308 00:38:57,080 --> 00:39:09,660 But. You can have a recommender system to guide them and that may help them stay on a good path. 309 00:39:11,940 --> 00:39:24,600 You might ask. Well, you collect a lot of measures, including affect, your can track emotions and engagement. 310 00:39:26,430 --> 00:39:32,310 Excuse me. The hot weather is getting to me. You can track how much initiative they take. 311 00:39:35,070 --> 00:39:38,310 You can see if they follow your recommendations. 312 00:39:38,610 --> 00:39:47,600 And all of that is stored in the cloud. I just want to make one point here. 313 00:39:49,230 --> 00:39:52,500 A lot of what we do involves natural language interpretation. 314 00:39:53,870 --> 00:39:58,640 And what we do is see experts how much they agree. 315 00:39:59,620 --> 00:40:04,060 And then we see how much the computer agrees with the experts. 316 00:40:04,990 --> 00:40:13,960 And then what we do is we see how how much can the computer analysis of natural language mimic an expert? 317 00:40:14,260 --> 00:40:24,150 And we find we're almost there. Whether you take a stringent criteria on interpreting the things or lenient, we're almost there. 318 00:40:24,160 --> 00:40:28,660 And this is in electronics. We've found it in other topics as well. 319 00:40:28,930 --> 00:40:38,410 So really assessing natural language and how well it kind of compares to expected information. 320 00:40:38,860 --> 00:40:42,310 Really, the field has advanced being pretty good. 321 00:40:43,320 --> 00:40:53,250 The last thing is the National Science Foundation Learning Data Institute, and we've been working on this during the last three years. 322 00:40:55,050 --> 00:41:05,940 The National Science Foundation is in the middle of building these larger centres, usually for about 20 to $25 million for a five-year period. 323 00:41:06,150 --> 00:41:08,670 And they have to involve many institutions. 324 00:41:09,000 --> 00:41:20,340 And we've been involved in a beginning part of one of these and we have a lot of groups that have to kind of. 325 00:41:22,340 --> 00:41:32,420 Collaborate and analyse data. As you know, data comes in many forms and formats and it's distributed all over the place. 326 00:41:33,230 --> 00:41:40,760 You need good data science on how to cobble together all those separate databases, 327 00:41:41,270 --> 00:41:49,070 and then you need to figure out how to apply advanced quantitative methods to analyse the data, 328 00:41:50,990 --> 00:41:57,770 whether it's machine learning or educational data, mining methods, A.I., whatever they are, 329 00:41:58,130 --> 00:42:05,990 sophisticated psychometrics, whatever it is, and to get more people involved as a community doing this. 330 00:42:06,830 --> 00:42:16,760 So. At this point, I'm involved with all these large groups of people leading, none of them. 331 00:42:17,480 --> 00:42:24,980 And at some point, I will be transitioning from transitioning to a full retirement. 332 00:42:25,310 --> 00:42:31,670 And I just don't know when that will be. My wife does have her views on that. 333 00:42:32,240 --> 00:42:45,670 Well, thank you so much. The next speaker, which will give some reflection on this topic, is Enterprise Professor Sandra Milligan, 334 00:42:46,100 --> 00:42:51,730 who is Director of the Assessment Research Centre at the Melbourne Graduate School of Education, University of Melbourne. 335 00:42:52,390 --> 00:42:57,790 And Sandra has an unusually wide engagement with education, industry and in education and research. 336 00:42:58,390 --> 00:43:06,160 Originally a teacher of science and mathematics, she's also a former Director of curriculum in an Australian state education department and has 337 00:43:06,160 --> 00:43:11,170 held senior research management and governance positions in a range of educational organisations, 338 00:43:11,620 --> 00:43:19,090 including government agencies, not for profits, small start-up businesses and large listed international corporations. 339 00:43:19,930 --> 00:43:26,560 Sandra's current research interests focus on assessment, recognition and warranting of hard to assess learning. 340 00:43:27,220 --> 00:43:34,270 She directs several research partnerships with school networks and organisations working to develop learner profiles for their students. 341 00:43:35,080 --> 00:43:42,969 She is lead author of the Future Proofing Australian Students with new credential report, outlining methods to reliably, 342 00:43:42,970 --> 00:43:49,570 assess and recognise the level of attainment with general capabilities and a recognition of learning success 343 00:43:49,690 --> 00:43:58,120 for all, ensuring trust and utility and a new approach to recognition of learning in senior secondary education in Australia. 344 00:43:58,770 --> 00:44:02,319 So Sandra, we're pleased to welcome you. 345 00:44:02,320 --> 00:44:10,690 The floor is yours. And look, first of all, can I just say it is a huge honour to be here. 346 00:44:11,050 --> 00:44:16,840 I just love being here in Oxford and at the Centre for Educational Assessment. 347 00:44:17,110 --> 00:44:22,629 It is a really well-known organisation globally but also in Australia. 348 00:44:22,630 --> 00:44:30,880 So my associations with Therese and Josh and before that with David Andrich, who is a great mentor of mine, 349 00:44:31,270 --> 00:44:41,919 has made me well know he may not claim to have mentored me, but I claim that he mentored me so he might not want to take the blame. 350 00:44:41,920 --> 00:44:51,520 But so I've long had in my imagination this centre and to be here today celebrating, I'm very honoured. 351 00:44:51,520 --> 00:45:01,630 So thank you very much. Actually, I wanted to take off with a final question to Art finished, 352 00:45:02,170 --> 00:45:18,520 and that is to do with the real importance of aligning what we value with what we teach, with what we want, learnt, with what we assess. 353 00:45:19,300 --> 00:45:26,800 So I'm going to be talking about that today and how we might do that in the context of the work that is 354 00:45:26,800 --> 00:45:36,910 assisting me and possibly other Australians in the assessment and certification in senior secondary. 355 00:45:37,390 --> 00:45:47,900 Now I've been watching over and I've been watching the papers on the post COVID and the COVID fallout on the A-levels. 356 00:45:47,920 --> 00:45:50,920 You know, do we need these exams? Should we have exams? 357 00:45:50,920 --> 00:45:54,760 Do they do the right thing? Is that what we should be doing? 358 00:45:55,000 --> 00:46:02,650 Those issues are being mirrored everywhere, at least in the Anglosphere, certainly in Australia. 359 00:46:03,790 --> 00:46:10,420 In the Australian context, people, including my colleagues at the University of Melbourne, 360 00:46:11,350 --> 00:46:21,030 who we say is the Oxford of Australia or possibly the Harvard, depending on which country we're talking. 361 00:46:22,150 --> 00:46:27,430 So my colleagues who run the medical schools, the engineering schools, 362 00:46:27,850 --> 00:46:34,210 what they're saying is the senior secondary examination and certification system is not doing its work. 363 00:46:35,050 --> 00:46:41,650 It's not giving us the people who are going to make the great doctors who will love to be engineers. 364 00:46:41,980 --> 00:46:48,100 It's giving us the high scores on examinations, and that's not who we need. 365 00:46:48,850 --> 00:46:54,819 Some of the people who are high scorers are not going to make the great doctors and 366 00:46:54,820 --> 00:47:00,910 some people who are low scorers could be brilliant engineers if given the chance, 367 00:47:01,030 --> 00:47:09,220 etc. In addition, in our senior secondary schools, we still can't get 100% of people finishing. 368 00:47:09,520 --> 00:47:12,910 Still 15 to 20% of students don't finish. 369 00:47:13,360 --> 00:47:19,690 That's terrible. Of those who do finish, a fair few of them truant. 370 00:47:20,440 --> 00:47:22,750 Don't turn up, don't care. 371 00:47:23,440 --> 00:47:35,140 Others sit conscientiously in the class, sending themselves into stress spirals and asking themselves, is this the pinnacle? 372 00:47:35,320 --> 00:47:38,410 Of learning that I'm doing in this thing. 373 00:47:38,650 --> 00:47:42,430 Do you see what I mean? Now, I'm not saying Australian education is terrible. 374 00:47:43,180 --> 00:47:48,880 I think it's great. But what I am saying is that there's room for improvement. 375 00:47:49,390 --> 00:47:55,150 And the general consensus in Australia is that there is room for improvement. 376 00:47:55,480 --> 00:47:59,020 We've had a number of reports over the last ten years or so. 377 00:47:59,020 --> 00:48:07,180 Those from Australia will recognise people like to read the Gonski reports, the Shergold reports, the review of the AQF. 378 00:48:07,810 --> 00:48:11,590 They're all saying we need to do something different, guys. 379 00:48:12,730 --> 00:48:18,580 And one of the reasons we don't is because of the assessment system. 380 00:48:19,210 --> 00:48:34,210 So I agree with Art that assessment can be an absolute killer of really good education, but it can also be the thing that can shift things. 381 00:48:34,420 --> 00:48:40,989 So it's better. So I want to tell you a little bit about the work that we're doing in senior 382 00:48:40,990 --> 00:48:47,469 secondary education to shift to what we value so that we get full engagement, 383 00:48:47,470 --> 00:48:52,780 enthusiastic engagement of all students in senior secondary. 384 00:48:53,200 --> 00:49:08,320 That's the goal. Just say something about the basis on which I'm going to make some comments, and that is Therese said, I'm an Enterprise Professor. 385 00:49:08,500 --> 00:49:13,900 I don't know whether you know what that means because I head an Enterprise Unit. 386 00:49:14,260 --> 00:49:17,590 What it actually means is that I have to get my own money. 387 00:49:18,280 --> 00:49:22,780 The university doesn't support us in that regard. 388 00:49:22,810 --> 00:49:27,910 In fact, we pay a tax really to be part of the university. 389 00:49:28,180 --> 00:49:33,370 I think that's possibly the way university research centres are going. 390 00:49:34,030 --> 00:49:40,240 And I am measured not only on revenue but on impact. 391 00:49:41,020 --> 00:49:46,870 So I have to show that the work in our centre is actually making a difference. 392 00:49:47,410 --> 00:49:55,510 The way we do that is that we problem solve for industry folk and most particularly in the area of assessment. 393 00:49:55,930 --> 00:50:02,740 We're very fortunate because we get people who are passionate about education, 394 00:50:03,370 --> 00:50:09,190 who hate the assessment system, who in who believes that there’s a better system. 395 00:50:09,520 --> 00:50:14,710 They come to us and say, will you help us reform assessment so that we can meet our goals? 396 00:50:15,220 --> 00:50:17,980 There are I call them first movers. 397 00:50:18,430 --> 00:50:28,780 So we are totally privileged to work with dedicated people who want to shift the system, and it's our job to do the R&D for them. 398 00:50:29,320 --> 00:50:33,730 What a pleasure. So, okay, what R & D are we doing? 399 00:50:33,760 --> 00:50:35,830 I'm going to go pretty quickly through this. 400 00:50:36,250 --> 00:50:44,680 And I'm assuming that these these slides can be distributed and if there's anything else you want to follow up on those and happy to chat. 401 00:50:45,280 --> 00:50:57,460 But this is it. This is what our first movers over the last in senior secondary have been telling us over the last decade is the objective. 402 00:50:58,180 --> 00:51:04,780 They want to get beyond just knowledge. 403 00:51:05,740 --> 00:51:10,720 They want to get to the point where students have agency in what they're doing. 404 00:51:10,730 --> 00:51:13,270 In other words, they learn what they think is important. 405 00:51:13,900 --> 00:51:23,650 They want to get beyond knowledge to competence so that learners not only know, but know how to do things. 406 00:51:23,890 --> 00:51:29,710 This is a big shift. This is a big shift. And they. 407 00:51:30,010 --> 00:51:34,330 So that's what they want. That little sort of what is it? 408 00:51:34,660 --> 00:51:46,210 Pentagon of hexagons. They're the categories of learning ambition that we've abstracted from our first movers. 409 00:51:46,420 --> 00:51:50,830 And you can see, I don't know if, no I haven't got a pointer. 410 00:51:51,070 --> 00:51:56,230 You can see that two of those on the left hand side, a fairly normal knowledge and know how. 411 00:51:56,240 --> 00:52:02,810 That's the curriculum as we know it. And basic literacies, that's the curriculum as we know it. 412 00:52:03,160 --> 00:52:05,860 But they're wanting to add those other three bits. 413 00:52:06,400 --> 00:52:18,790 They want to teach learners to manage their own learning, not just be drilled to death and coached to the nth degree to pass the exams. 414 00:52:19,570 --> 00:52:30,880 They want to teach students how to connect to sustaining communities that they will need to be part of to thrive. 415 00:52:30,940 --> 00:52:34,720 Thrive is the key word. So that's communities. 416 00:52:34,830 --> 00:52:40,860 Of scholarship communities of work, communities of culture, communities of community, 417 00:52:41,280 --> 00:52:48,000 so that I want to teach students how to embed themselves in those sustaining communities. 418 00:52:48,240 --> 00:52:50,790 And I want to give them the learning stable. 419 00:52:50,790 --> 00:53:04,290 So 21st century skills or general capabilities, because that's the skill set that enables students to build depth and breadth in their learning. 420 00:53:04,440 --> 00:53:09,660 That's what they want to do. And they come to us and say, Well, that's what we want to do. 421 00:53:09,990 --> 00:53:13,650 Can you fix the assessment system so that we can do it? 422 00:53:14,190 --> 00:53:28,980 All right. The first step, of course, is to actually define these learning ambitions in ways that they're teachable, learnable and assessable. 423 00:53:29,520 --> 00:53:38,610 It's very easy for people to drift off into things that aren't really curriculum objectives. 424 00:53:38,760 --> 00:53:47,520 For instance, I mean, I'll probably make enemies by saying this, but if you take the issue of student well-being, 425 00:53:48,660 --> 00:53:59,130 I mean, all sorts of things can cause a student to have very poor sense of self-esteem or well-being at any given time. 426 00:53:59,640 --> 00:54:03,750 It's unlikely that a school can guarantee 427 00:54:05,980 --> 00:54:08,420 well-being of every student. 428 00:54:08,440 --> 00:54:20,110 But they can guarantee that they are teaching the skills that students need to that they will need that will enable them to thrive. 429 00:54:20,440 --> 00:54:34,900 So we spend a lot of time trying to focus people on defining what it is they’re actually wanting to teach that students can learn and that is assessable. 430 00:54:35,200 --> 00:54:45,350 So it's important that this is a curriculum based, not a generalised happiness sort of thing. 431 00:54:45,390 --> 00:54:48,130 Do you see what I mean? I'm not expressing that very well. 432 00:54:50,260 --> 00:54:57,070 For instance, when it comes to agency and learning, I won't go through this, we’ve worked through. 433 00:54:57,430 --> 00:55:09,640 What is that? What are the skill features that you need if you're going to actually establish a learning environment that will give students agency? 434 00:55:09,670 --> 00:55:21,190 So we've done a detailed analysis of many of these skills so that we understand what they are as objects of teaching and learning. 435 00:55:22,540 --> 00:55:31,149 I've just put this quickly up here. I probably will deny having shown you this, because this is a framework that the Federal, 436 00:55:31,150 --> 00:55:41,130 our Commonwealth Government is currently consulting on to be the basic framework for post post-school post-secondary education, 437 00:55:41,980 --> 00:55:52,130 post compulsory senior secondary education, the national framework that will capture what it is those curriculum elements are. 438 00:55:52,180 --> 00:55:57,490 I mean, that's a very interesting document. It's not policy yet, 439 00:55:57,970 --> 00:56:03,430 but I'm hoping this or something like it will soon be policy in Australia and it 440 00:56:03,430 --> 00:56:08,200 says these are the skills that you need as well as the knowledge and the know-how. 441 00:56:09,400 --> 00:56:19,660 If you're a school leaver. At that point, most teachers freak out. 442 00:56:21,350 --> 00:56:28,940 and they say, Oh, well, it's all very well to have a nice framework, but how do you assess those things? 443 00:56:30,020 --> 00:56:39,710 And so we've put a lot of effort into getting teachers to the point where they feel confident and capable in assessing these things. 444 00:56:39,920 --> 00:56:45,080 And you know, this is my favourite sort of picture from Dreyfus. 445 00:56:45,410 --> 00:56:49,460 You might remember the Dreyfus taxonomy of competence. 446 00:56:49,880 --> 00:56:56,990 And so what we've discovered is that for each of these general capabilities or 21st century skills, 447 00:56:57,350 --> 00:57:08,060 you can define a progression with five or six big leaps in the constellation of skills that people have. 448 00:57:08,390 --> 00:57:16,490 And so the challenge for teachers is just to understand those big qualitative leaps, 449 00:57:16,970 --> 00:57:29,770 incompetence that students go through and then learn what to look for to place them on those positions in our final ranking in Australia. 450 00:57:30,050 --> 00:57:34,010 We've gotten a thousand-point scale. 451 00:57:34,100 --> 00:57:40,970 It's called the ATAR, on which every senior secondary graduate is placed. 452 00:57:41,000 --> 00:57:44,090 If you don't believe me, ask the other Australians in the room. 453 00:57:46,310 --> 00:57:54,050 And so when we say hmm, we think that there are five levels, not a thousand levels. 454 00:57:54,350 --> 00:57:58,700 People think we're being radical okay. It's sort of interesting. 455 00:58:00,020 --> 00:58:07,010 This is an example of one of those progressions that show that we've used. 456 00:58:07,250 --> 00:58:11,270 This was one we used in the Philippines. 457 00:58:11,690 --> 00:58:18,440 It's one on collaboration. You might. It's sort of like collaborative problem solving. 458 00:58:18,920 --> 00:58:24,860 But the key thing for a teacher is they recognise those levels. 459 00:58:24,890 --> 00:58:31,550 They recognise that at the lowest level, a student needs to be told what to do with others. 460 00:58:32,750 --> 00:58:44,120 And then in the middle, they get quite used to the idea that they're working with a group and now fit in and and go with the flow. At the top level. 461 00:58:44,510 --> 00:58:52,640 They're the people who organise the environment, get, engage and motivate others to collaborate. 462 00:58:53,030 --> 00:59:01,490 And once you explain that and, and explain the behavioural indicators that go with each of those, 463 00:59:01,760 --> 00:59:06,320 we find that teachers fairly soon relax and say, I can do that. 464 00:59:07,040 --> 00:59:10,220 And this sort of fear factor disappears. 465 00:59:12,890 --> 00:59:16,970 To assess it though, you cannot use examinations. 466 00:59:17,000 --> 00:59:21,860 These are not cognitive skills. You cannot use multiple choice questions. 467 00:59:22,400 --> 00:59:34,940 You can't really use stuff that's written down. It is needs to be performance based, multiple raters, multiple performances, 468 00:59:35,420 --> 00:59:49,850 and so that there can be a human based judgement of the level of attainment against the of the performance of a student. 469 00:59:50,390 --> 00:59:54,700 So getting that organised in a school is a non-trivial task. 470 00:59:54,850 --> 01:00:01,670 I'm always forever happy to talk about this ad nauseum, but I'm going to go pretty quickly here. 471 01:00:02,000 --> 01:00:03,500 Now, this is the coup de grâce. 472 01:00:04,490 --> 01:00:19,220 This is what a senior secondary certificate currently is, and this exists in Australia to capture the attainments of a student. 473 01:00:19,940 --> 01:00:24,020 What you will notice about it, first of all, it's a digital document. 474 01:00:25,730 --> 01:00:32,570 It's got links all through it. So you can see that there are links to statements by Abbie. 475 01:00:33,080 --> 01:00:40,070 You can see that it's got links to her portfolio where you can see all the authentic performances that she's performed. 476 01:00:40,820 --> 01:00:46,850 The chrysanthemum in the middle. We call it a chrysanthemum because no one knows how to spell that. 477 01:00:46,850 --> 01:00:53,300 So it makes sense. Right? But it's a rose graph for those who are technically minded. 478 01:00:53,690 --> 01:00:59,960 We call it a chrysanthemum, and that is a standards based. 479 01:01:01,490 --> 01:01:15,049 highly psychometrically, reliable, judgement-based assessment of the level of attainment of Abbie on the five areas that she was assessed on, 480 01:01:15,050 --> 01:01:21,020 which were quantitative reasoning, knowing how to learn, communication and whatever. 481 01:01:21,680 --> 01:01:31,009 And this has been. This is currently being used in 20 schools in Australia and a number in the US and has now been accepted 482 01:01:31,010 --> 01:01:40,760 by 50% of our Australian universities as the legitimate alternative to the senior secondary assessments. 483 01:01:42,170 --> 01:01:45,530 The kids love it. Everyone loves the chrysanthemum. 484 01:01:46,010 --> 01:01:57,410 But most importantly you can use the metrics underpinning the chrysanthemum to do the sorts of things 485 01:01:57,440 --> 01:02:06,080 that would enable matching of student attainment to opportunities that exist like you might get 99 in ATAR. 486 01:02:06,470 --> 01:02:11,720 But if you've got no empathy, there's no point in going into medicine, etc. 487 01:02:11,730 --> 01:02:17,140 No, that's probably not true, but never mind. 488 01:02:22,820 --> 01:02:32,660 So we're currently working with one of the biggest university selector groups, UAC, University Admissions Centre, 489 01:02:33,500 --> 01:02:42,830 working out what sorts of uses this could be made in universal secondary selection. 490 01:02:42,950 --> 01:02:46,580 This is really interesting stuff, really interesting stuff. 491 01:02:47,750 --> 01:02:50,440 And it's not just the big picture one. 492 01:02:50,450 --> 01:03:03,530 We're involved in a range of new credentialing projects where first movers in Australia are being given support by the University of Melbourne. 493 01:03:03,680 --> 01:03:13,190 University of Melbourne is actually underwriting the quality of the credentials so that we can move this along a bit further. 494 01:03:13,760 --> 01:03:17,900 I'm going to finish now, but I would just say this. 495 01:03:19,250 --> 01:03:27,620 From the point of view of being the Director of the Assessment Research Centre and all those who travel in her, in the centre. 496 01:03:28,070 --> 01:03:40,570 I, we see this sort of activity as a fundamental I call it D an R development and research activity. 497 01:03:40,580 --> 01:03:51,740 It's not traditional research. It's actually trying to solve a problem, coming up with a solution, working out the science that's underneath it, 498 01:03:52,250 --> 01:03:57,920 and then packing the research evidence around it and altering it where it doesn't stack up. 499 01:03:58,490 --> 01:04:05,899 These are the kinds of empirical questions that we're working on with our first 500 01:04:05,900 --> 01:04:12,980 movers and with the very interested observation of the policy people in Australia. 501 01:04:13,520 --> 01:04:28,010 And I'm hopeful that fairly soon we'll see learnt profiles based on things that we value that will help students thrive, 502 01:04:28,640 --> 01:04:37,460 will become the lingua franca of the senior secondary certification system to help us match people to opportunities. 503 01:04:37,760 --> 01:04:40,820 And it will be brilliant. So there you go. Thank you very much. 504 01:04:46,180 --> 01:04:49,450 Thank you so much. That was extremely fascinating. 505 01:04:49,450 --> 01:04:59,440 So now I think we need to listen to a little bit about evaluating a method for predicting the compact 506 01:04:59,500 --> 01:05:04,540 comparability of International Baccalaureate cross-lingual assessments by combining psychometrics, 507 01:05:04,870 --> 01:05:09,189 machine learning and natural language processing. And Josh McGrane, 508 01:05:09,190 --> 01:05:15,790 our Deputy Director of the Oxford University's Centre of Educational Assessment and Senior Research Fellow and Fellow at Kellogg College. 509 01:05:16,390 --> 01:05:22,540 He completed his university medal winning Ph.D. in quantitative psychology at the University of Sydney. 510 01:05:23,500 --> 01:05:29,520 He has also been a post-doctoral fellow at the University of Western Australia and worked as a psychometrician 511 01:05:29,740 --> 01:05:36,970 for the Centre for Education Statistics and Evaluation in the New South Wales Department of Education. 512 01:05:38,200 --> 01:05:43,420 As a result, he has extensive experience in academic and government contexts in education assessment, 513 01:05:43,810 --> 01:05:49,720 including being an expert adviser for several national and international government and non-profit organisations. 514 01:05:50,410 --> 01:05:56,530 His research and teaching interests span the philosophical and political and statistical aspects of psychometrics. 515 01:05:57,160 --> 01:06:06,190 He's presently working on several projects, including the development of a better framework for measurement across the physical and social sciences, 516 01:06:06,580 --> 01:06:12,340 combining psychometrics and AI modelling to predict and explain bias and education assessments, 517 01:06:12,760 --> 01:06:20,860 educating critical thinking amongst secondary students internationally and the development of language assessments for understudied languages. 518 01:06:21,400 --> 01:06:28,780 So, Josh, the floor is yours. Thank you, Therese. 519 01:06:28,780 --> 01:06:35,019 So I'm now very much convinced that the weather gods of Great Britain hate me and 520 01:06:35,020 --> 01:06:39,459 they are vengeful because I've now lived for going on six years and most days, 521 01:06:39,460 --> 01:06:43,870 just as the people I've worked with have complained about the weather and what they've done, 522 01:06:44,140 --> 01:06:48,910 they've turned around on the one day when it's supposed to be our last hurrah and celebration and 523 01:06:48,910 --> 01:06:53,350 given me a day of Australian weather so that half of the British people didn't show up today. 524 01:06:53,350 --> 01:06:57,100 So thank you for your vengeful irony in that regard. 525 01:06:57,640 --> 01:07:01,330 Now I just want to say thank you to the two speakers that have come before me. 526 01:07:01,330 --> 01:07:08,229 It's both been wonderful talks and funnily enough, a lot of overlap with what I'm about to say in sort of nuanced ways. 527 01:07:08,230 --> 01:07:13,209 But I will say that we're going from sort of big thinking, big picture ideas to a very, 528 01:07:13,210 --> 01:07:18,790 very small-scale research study that really is only getting underway now. 529 01:07:19,000 --> 01:07:26,020 In fact, I'd say we're about two metres into the 100 metre sprint and we probably stumbled a little bit out of the blocks. 530 01:07:26,020 --> 01:07:31,860 Okay, so spoiler alert, the findings aren't that amazing, but I wanted to talk about something more. 531 01:07:31,870 --> 01:07:35,049 We're talking about the future of assessment, the future of research. 532 01:07:35,050 --> 01:07:42,400 Well, often in these kinds of talks, you get people basically reflecting on things they've been doing for the last ten years. 533 01:07:42,400 --> 01:07:46,270 That makes it seem also simple and neat and linear, and it all goes together. 534 01:07:46,540 --> 01:07:49,749 Well, I can assure you, when you're on the ground actually taking risks, 535 01:07:49,750 --> 01:07:55,329 having no idea whether these small but big ideas are going to work out that it's not that neat. 536 01:07:55,330 --> 01:07:59,440 And I think that's an important lesson for the students in the room to learn in particular. 537 01:07:59,890 --> 01:08:04,870 And on that note, I just want to thank people who have made the effort to come today despite the heat. 538 01:08:04,870 --> 01:08:08,320 So, Jonathan Michie, the president of Kellogg College, thank you for coming along. 539 01:08:09,160 --> 01:08:14,290 I see several of my colleagues in the room as well, so thank you for making the effort as well our centre, 540 01:08:14,290 --> 01:08:21,040 both present and former and friends and colleagues and what have you. 541 01:08:21,040 --> 01:08:26,650 So thank you to everyone for coming along and listening to me talk about this rather nerdy stuff that you're about to hear. 542 01:08:27,040 --> 01:08:32,120 So. What we just to give you some context. 543 01:08:32,140 --> 01:08:38,950 We were contracted by the International Baccalaureate Organisation to do a study looking at whether their translation 544 01:08:38,950 --> 01:08:45,850 processes were working well in their Diploma Programme examinations and specifically within their science programme. 545 01:08:45,890 --> 01:08:53,290 So interestingly, the Diploma Programme qualification I see has a lot of overlaps with what you were showing 546 01:08:53,290 --> 01:08:58,390 that the Australian Federal Department is considering putting in as actual policy now. 547 01:08:58,420 --> 01:09:02,110 So I hope that those two bodies are in a lot of conversation with one another. 548 01:09:03,520 --> 01:09:08,680 But one unique aspect of the International Baccalaureate that isn't present in 549 01:09:09,040 --> 01:09:13,959 many national systems is the fact that you have to make your assessments valid, 550 01:09:13,960 --> 01:09:18,690 comparable, unbiased across several languages, in this case as many languages. 551 01:09:18,700 --> 01:09:22,810 But in this particular research we would be looking at three of them. 552 01:09:23,320 --> 01:09:26,950 And so normally when asking this sort of question, 553 01:09:27,850 --> 01:09:33,879 Are examinations comparable across across languages will be doing the typical things we do within assessment we’ll first and 554 01:09:33,880 --> 01:09:45,430 foremost we’ll be having a look at the translation processes and evaluating whether they meet best practise and what have you. 555 01:09:46,410 --> 01:09:54,120 Also, we might have actual human experts look at the different exam scripts and make judgements about whether 556 01:09:54,120 --> 01:09:59,090 they would predict based on differences and nuanced differences in the language that they see, 557 01:09:59,100 --> 01:10:04,710 whether you might predict there to be differences in the performance of the items across those languages. 558 01:10:04,980 --> 01:10:09,320 Now, as part of this broader research project, we did look at both of those questions. 559 01:10:09,330 --> 01:10:11,399 Yasmine, sitting in the crowd here in particular. 560 01:10:11,400 --> 01:10:17,250 did, but I'm not really going to be touching on that at all, just given the amount of time that I have available to me. 561 01:10:17,640 --> 01:10:22,530 And so what I am going to be focussed on focussing on is the part of the research study that I was driving, 562 01:10:23,190 --> 01:10:28,379 which was looking at can you actually use AI essentially to predict differences 563 01:10:28,380 --> 01:10:33,180 in performances of items across multiple languages for high stakes assessment? 564 01:10:34,930 --> 01:10:40,690 Now that's an awful mouthful to get through as a title, so I thought I better come up with another one. 565 01:10:40,930 --> 01:10:45,340 And for those of you familiar with the work of Stanley Kubrick and Dr. Strangelove, 566 01:10:45,730 --> 01:10:50,140 an alternative title would be how I learnt to stop worrying and love statistics (and AI). 567 01:10:50,320 --> 01:10:57,520 And now, of course, I have my tongue firmly in my cheek here because we have a quote from Borsboom, 568 01:10:57,520 --> 01:11:05,259 from the University of Amsterdam and one of his colleagues where he refers to the psychological test or now context. 569 01:11:05,260 --> 01:11:09,880 We might think of it more broadly as the educational assessment as being the atomic bomb. 570 01:11:09,940 --> 01:11:13,780 Now, interestingly, we had atomic bombs come up in one of the earlier speeches. 571 01:11:14,110 --> 01:11:17,380 Now, clearly, he's using a degree of hyperbole there. 572 01:11:17,590 --> 01:11:23,319 But what he's really getting at is that this is the one thing that assessment is giving the world both 573 01:11:23,320 --> 01:11:30,340 psychological and an educational assessment that has actually radically changed global power structures. 574 01:11:30,700 --> 01:11:34,510 And so moving into the future of assessment, me as a psychometricia, 575 01:11:34,900 --> 01:11:44,260 I know how much power I am often asked to imbue in assessments that magically those magical terms of is it valid and reliable? 576 01:11:44,480 --> 01:11:52,660 Okay, I put it through my statistical models, spit out the answers at the other end and say, yes, it's kind of valid, it's kind of reliable. 577 01:11:52,900 --> 01:11:57,100 And then you say it's valid, reliable, and go forth and change people's lives with it. 578 01:11:57,130 --> 01:12:04,050 So for me, I just want to make clear that although I'm mashing up a lot of contentious areas here, 579 01:12:04,060 --> 01:12:07,390 psychometrics in itself can be quite contentious and reductive, 580 01:12:07,570 --> 01:12:13,479 along with machine learning and all the perils of A.I. that I'm sure most of us in the room are aware of, 581 01:12:13,480 --> 01:12:19,510 that it is done with consideration of the actual ethical implications involved. 582 01:12:19,750 --> 01:12:25,540 I hope that this is a fairly benign application, but I'm happy to hear feedback on that. 583 01:12:26,740 --> 01:12:31,510 But I would just take a quick segue from the actual topic, because I think it is important, 584 01:12:31,510 --> 01:12:39,159 given that this is the last annual lecture that will be hosted by Therese and I that I do acknowledge the host herself. 585 01:12:39,160 --> 01:12:44,860 And first and foremost, I should have Sam's photo up here as well because they both organised the event today. 586 01:12:44,860 --> 01:12:46,450 So thank you on that behalf. 587 01:12:47,140 --> 01:12:53,550 But Therese is someone who spends a lot of time acknowledging others and very much diverts acknowledgement away from herself. 588 01:12:53,560 --> 01:12:57,400 So I am probably embarrassing the crap out of her right now, but that's okay. 589 01:12:57,430 --> 01:13:03,520 It's something that I want to do, Therese has spoken about the achievements that 590 01:13:03,520 --> 01:13:07,870 we've achieved together and collectively as a centre over the last five years. 591 01:13:07,870 --> 01:13:12,910 And I think it's a testament to both us as a team, but particularly her as a leader, 592 01:13:13,270 --> 01:13:18,930 that a lot of that has happened in the last two and a half years and the last two and a half years have been very difficult. 593 01:13:18,970 --> 01:13:22,629 I'm not sure if anyone has noticed, you know, on several, several levels. 594 01:13:22,630 --> 01:13:30,190 And so I just wanted to say to you Therese, you brought up the topics of compassion and integrity, 595 01:13:30,460 --> 01:13:34,930 well I could not think of a leader who embodies those two principles more than you do. 596 01:13:34,930 --> 01:13:39,159 So it's my pleasure to have you as a leader. My pleasure to have you as a colleague. 597 01:13:39,160 --> 01:13:41,170 And my pleasure to have you as a friend. So thank you. 598 01:13:48,900 --> 01:13:56,940 Also, just as it takes a whole village to raise a child, it takes a whole research team to do a set of involved research projects like this. 599 01:13:57,330 --> 01:14:01,830 And so I just wanted to give a shout out here to Yasmine El-Masri, who's in the crowd today. 600 01:14:01,830 --> 01:14:08,070 As I said, she actually conducted a large part of this broader research study that I'm not going to be talking about today. 601 01:14:08,700 --> 01:14:16,970 And also, you know, it's her work and her connection to Art Graesser and the work of Coh-Metrix, which is part of the kernel of the idea in the first place. 602 01:14:16,980 --> 01:14:23,940 So I just wanted to acknowledge that too. Kit Double, another former post-doc of us who's fortunate enough to already be back in Australia, 603 01:14:26,250 --> 01:14:32,850 who, you know, did a lot of the dogsbody work of having to code and clean and what have you, which is like the unsexy part of research, 604 01:14:32,850 --> 01:14:38,130 but it's extremely important and people don't really acknowledge just how important it is often enough. 605 01:14:39,300 --> 01:14:48,330 Heather Kayton, who basically has been my right hand person, I guess I should say, throughout this particular aspect of the project in particular. 606 01:14:48,330 --> 01:14:54,180 And I'm really thankful to the support and also the expertise given her background in applied linguistics, 607 01:14:54,180 --> 01:15:02,370 helping me to really understand what all this natural language process, blah blah actually means in terms of real language stuff. 608 01:15:03,300 --> 01:15:09,540 So she has really, really helped me along the way and I hope that you'll continue with this. 609 01:15:10,530 --> 01:15:16,950 And then finally, Rebecca Hamer, who was the project coordinator from the IB side of things. 610 01:15:17,400 --> 01:15:18,750 She's based in The Hague. 611 01:15:18,750 --> 01:15:26,160 And I must admit, she did bust my balls a lot throughout this project, but sometimes that is required from a project manager. 612 01:15:26,310 --> 01:15:36,380 But she also helped an awful lot too, in terms of providing data, processing it and providing support by way of expert judges from the IB as well. 613 01:15:36,390 --> 01:15:41,040 So I just shout out to her because I do appreciate the input that she had to the project. 614 01:15:41,460 --> 01:15:48,840 Now really returning to the actual substance of the talk now and to just to create a little bit of synchronicity with Sandra. 615 01:15:49,410 --> 01:15:57,180 One last acknowledgement is a mentor of mine, and I think he would actually acknowledge that I was mentored by him, or at least I hope so. 616 01:15:57,180 --> 01:16:00,210 He was a reference to my job interview, so I guess he said nice things. 617 01:16:02,010 --> 01:16:11,459 Where is this line of research for me is coming from is David Andrich’s view of Rasch’s measurement theory and that's the 618 01:16:11,460 --> 01:16:18,210 idea that we have these psychometric models that we apply because they have good and desirable measurement principles. 619 01:16:18,480 --> 01:16:22,290 And when we find problems, what he often refers to as anomalies, 620 01:16:22,410 --> 01:16:29,190 it's not an indication that we should go and get a different, fancier, better, more uninterpretable model. 621 01:16:29,400 --> 01:16:35,190 But rather what we should do is take that as a lesson that perhaps there's something wrong with the assessment itself, 622 01:16:35,490 --> 01:16:39,150 whether that be in the administration or the substance of it. 623 01:16:39,360 --> 01:16:41,880 And so this is where I'm coming from here. Okay. 624 01:16:42,210 --> 01:16:49,170 We've had 1,000,001 studies in the world about multilingual assessments showing cross-lingual assessments. 625 01:16:49,170 --> 01:16:55,140 Sorry, Heather, I’m make you cringe in saying multilingual, cross-lingual assessments showing differential item functioning. 626 01:16:55,860 --> 01:17:04,010 But why? We need to predict and explain so that we can actually take these signs of anomalies and improve the assessments going forward. 627 01:17:06,040 --> 01:17:09,040 So that's the general aim of this particular study. 628 01:17:09,070 --> 01:17:10,430 I'm going to have to hurry up a little bit. 629 01:17:10,450 --> 01:17:20,019 But essentially what we needed to do is basically test out as a first stepping point, an innovative and I'd say, you know, 630 01:17:20,020 --> 01:17:24,190 arguably one of the best aspects of it is that it's scalable, 631 01:17:24,190 --> 01:17:28,690 doesn't rely on you having to go back to humans and then making sense of things and whatnot. 632 01:17:28,690 --> 01:17:32,979 So if it works, it's a scalable approach to predicting language effects, 633 01:17:32,980 --> 01:17:40,600 i.e. these anomalies on item functionally across different language versions of the high stakes IB Diploma Programme Examinations. 634 01:17:40,600 --> 01:17:46,840 For those of you who don't know, the Diploma Programme is actually the end of secondary schooling qualification for the IB. 635 01:17:46,870 --> 01:17:57,050 So the thing that determines whether you go to university or on to other relevant post-secondary pursuits and I, you know, 636 01:17:57,100 --> 01:18:03,909 we have been very fortunate to have the IB as a research funder and partner because they have been open to 637 01:18:03,910 --> 01:18:10,899 us doing more of this sort of blue-sky research in the context of also doing practical things for them, 638 01:18:10,900 --> 01:18:14,320 like reviewing translation processes and what have you. 639 01:18:15,730 --> 01:18:21,940 And so, what we've been what I propose to do is to basically mash up psychometrics, 640 01:18:21,940 --> 01:18:26,589 machine learning and natural language processing to understand the extent to which the 641 01:18:26,590 --> 01:18:32,950 differences in item performance are related to differences in the textual complexity 642 01:18:32,950 --> 01:18:38,079 of the audience across languages and to determine which particular aspects of textual 643 01:18:38,080 --> 01:18:42,890 complexity are the most predictive of any cross language differences that we find. 644 01:18:42,910 --> 01:18:47,110 So in order to understand the rest of the talk, you're just going to have to understand psychometrics, 645 01:18:47,110 --> 01:18:51,489 machine learning and natural language processing. Okay. No, just joking. 646 01:18:51,490 --> 01:18:57,610 Just follow the gist of it. In terms of what data we were dealing with. 647 01:18:57,610 --> 01:19:00,639 Pretty complex stuff when you started breaking it all down, 648 01:19:00,640 --> 01:19:08,620 I must admit we went into the contract not quite realising just how complex what they were asking was, but there's another lesson for future in that. 649 01:19:09,100 --> 01:19:12,159 So we're dealing with 90,000 students. 650 01:19:12,160 --> 01:19:14,709 Most of them responded to the exam in English, 651 01:19:14,710 --> 01:19:22,380 but we have a fair proportion who responded to it in Spanish and a much smaller number who responded to it in French. 652 01:19:22,390 --> 01:19:27,280 What are we talking about here? We're talking about 12 exam papers, Biology, Chemistry and Physics. 653 01:19:27,550 --> 01:19:34,870 So subjects across two calendar years that have these levels of subjects as well as at a higher level than the standard level. 654 01:19:35,110 --> 01:19:41,829 And they actually have three papers within each. But we just looked at the first two because Paper 3 involves optionality, 655 01:19:41,830 --> 01:19:48,820 which just creates even more complexity to deal with Paper 1 being a multiple choice exam and Paper 2 being constructed response. 656 01:19:49,150 --> 01:19:54,040 So all in all, we have 495 items in total across these different factors. 657 01:19:55,600 --> 01:19:59,139 So in order to build a predictive, predictive model, what do we have to do? 658 01:19:59,140 --> 01:20:05,110 So we use the machine learning approach to identify optimal sets of features that predict the outcome variable, 659 01:20:05,110 --> 01:20:10,390 i.e. whether there was differential item functioning or not. I realise I haven't defined that yet, but I will in a minute. 660 01:20:10,900 --> 01:20:15,400 So, this is done using the ‘caret’ package in R. Now, this is just an incredible package. 661 01:20:15,400 --> 01:20:21,910 It still blows my mind with the open-source software, how people create this stuff and distribute it like free of charge. 662 01:20:22,270 --> 01:20:28,059 I recommend any student out there, even if you're just doing a basic linear regression or logistic regression to look 663 01:20:28,060 --> 01:20:32,799 at this package because it allows you to apply what's referred to as cross validation, 664 01:20:32,800 --> 01:20:36,550 which I will get to in a second, which is so rarely done in our field. 665 01:20:36,580 --> 01:20:41,170 And, you know, the data scientist in the crowd is nodding his head. 666 01:20:41,170 --> 01:20:45,820 It's very necessary for us to start implementing that as standard practise in our field. 667 01:20:46,840 --> 01:20:51,100 And so to do this, it's first necessary to establish the outcome variable, 668 01:20:51,100 --> 01:20:56,079 which is the DIF estimates from the TAM package in as well as the predictive variables. 669 01:20:56,080 --> 01:21:01,210 And so where we got this from was an open source software called READERBENCH. 670 01:21:02,380 --> 01:21:08,200 And also we included the test and item characteristics in there as predictive variables as well. 671 01:21:09,850 --> 01:21:18,220 So detecting the DIF, basically differential item functioning is when an item displays differences in difficulty. 672 01:21:19,730 --> 01:21:24,470 Across different groups, we would estimate as having the same underlying ability, 673 01:21:25,340 --> 01:21:30,440 which means that they have a different probability of getting a correct response for that item. 674 01:21:30,740 --> 01:21:35,270 So between groups, between, say, English responders and French responders, 675 01:21:35,270 --> 01:21:41,030 it may well be that there's on average at different level of attainment on a particular item. 676 01:21:41,660 --> 01:21:44,680 So where DIF is different to that, it's not average performance. 677 01:21:44,690 --> 01:21:51,680 It's if we take, say, the French and English speakers who we believe have the same underlying ability level, 678 01:21:51,860 --> 01:21:55,100 is the item differentially difficult for them? 679 01:21:57,300 --> 01:22:04,140 And the particular model that we applied here is the random coefficient multinomial logit Rasch model. 680 01:22:04,830 --> 01:22:10,920 Now you don't need to understand what that is, other than to know that this is a parameterization of the Rasch model. 681 01:22:11,150 --> 01:22:19,740 That's much more like a multilevel model. The reason why that matters is because DIF can be understood in the parameterization of 682 01:22:19,740 --> 01:22:23,430 the model this way as being an interaction between the item difficulty parameter. 683 01:22:23,790 --> 01:22:27,540 And then you can add a group for group parameter into the model. 684 01:22:27,780 --> 01:22:34,290 In addition to that, you can also add in covariance. And this is important because so often when people evaluate DIF, 685 01:22:34,290 --> 01:22:39,090 it's just like this group factor is an analysis followed by this group factor as an analysis. 686 01:22:39,420 --> 01:22:43,979 And the fact is that the language of the test interacts with all sorts of other different group factors. 687 01:22:43,980 --> 01:22:51,100 So, for example. Are they responding to in that language which just so happens to be their native language? 688 01:22:51,100 --> 01:22:53,530 Is it their language at home? Okay. 689 01:22:53,770 --> 01:23:02,200 That would be one key covariate that's going to potentially show up as being language DIF when it's actually nothing to do with language DIF potentially. 690 01:23:02,740 --> 01:23:07,629 And so we didn't really have the sheer amount of data required to delve into that too far. 691 01:23:07,630 --> 01:23:15,490 But we did include some covariates like that, for example, whether the test languages matches the home language as well as gender, 692 01:23:15,640 --> 01:23:19,900 as well as a couple of other factors to try to purify it, so to speak. 693 01:23:20,080 --> 01:23:28,260 The DIF estimate that we had based on language. And so this gives us DIF as both a continuous, standardised outcome, 694 01:23:28,260 --> 01:23:35,340 but it can also be categorised into basically an effect size measurement as well as small, negligible DIF, moderate and large. 695 01:23:36,860 --> 01:23:43,610 And so I really don't want you to glaze over trying to read all of this because I appreciate it's too much information, 696 01:23:43,610 --> 01:23:49,100 but from here on is the DIF that basically is considered to matter. 697 01:23:49,400 --> 01:23:53,870 And you can see here that by and large, the percentages across the different subjects, 698 01:23:53,870 --> 01:23:59,300 across the different levels, across the different languages are pretty small. 699 01:23:59,510 --> 01:24:04,820 So this is a good news story for theIB, basically, that it seems that they are. 700 01:24:04,970 --> 01:24:08,930 And then I'll show you for 2019 as well, it's probably even better. 701 01:24:09,140 --> 01:24:14,150 So by and large, it looks like they're doing a pretty good job of the translation. 702 01:24:14,450 --> 01:24:17,650 It's a bad news story for us, however, because, you know, 703 01:24:17,750 --> 01:24:23,440 machine learning wants lots of observations of different kinds of categories and we don't have them here. 704 01:24:23,450 --> 01:24:32,810 So, I'll come back to that at the end. So generally, the findings were very positive for the IB DP Science examinations. 705 01:24:34,220 --> 01:24:39,260 And as I pointed out, unfortunately, because there are so few items showing moderate and large 706 01:24:39,330 --> 01:24:43,460 DIF, we do have this highly imbalanced data set for later class predictions. 707 01:24:45,810 --> 01:24:49,200 So, the second step is to establish textual complexity. 708 01:24:49,210 --> 01:24:52,620 So, we've got the outcome variable. Now we need the predictive variables. 709 01:24:53,010 --> 01:24:58,950 Now the way that we did that, as I said earlier, was using the textual analysis software, READERBENCH. 710 01:24:59,280 --> 01:25:04,499 Any of you who know about NLP and the prediction of text complexity know that 711 01:25:04,500 --> 01:25:08,100 there's a strong overlap between READERBENCH and what's offered in Art. 712 01:25:08,110 --> 01:25:10,530 Graesser and his colleagues product, Coh-Metrix. 713 01:25:10,860 --> 01:25:17,280 It's just that READERBENCH has been generalised to more languages than Coh-Metrix has and it's also completely open source as well. 714 01:25:17,280 --> 01:25:20,370 So, it was helpful for us in that regard. 715 01:25:20,910 --> 01:25:24,989 And so, I'm not going to go into too much of the technical detail here, 716 01:25:24,990 --> 01:25:30,930 but basically pre-process, pre-process is the text using fairly standard core NLP procedures, 717 01:25:31,200 --> 01:25:38,940 basically, tokenising the bits of the text tagging, the kinds of words, etc. fronted adverbials, and what not. 718 01:25:39,450 --> 01:25:42,870 Michael Gove joke in there and dependency passing, 719 01:25:42,870 --> 01:25:50,040 which is to do with grammatical structures, and it also applies language specific resources as well. 720 01:25:50,040 --> 01:25:56,909 So language specific corpora, the actual body of text that it's drawing upon and the lexical ontologies and semantic models 721 01:25:56,910 --> 01:26:00,810 which have to do with the sort of semantic modelling of relationships of the text as well. 722 01:26:04,550 --> 01:26:11,480 So it provides you with a wide array of indices over 300 for understanding the textual complexity in multiple languages. 723 01:26:11,930 --> 01:26:16,100 And, you know, I think this is still an open empirical question, 724 01:26:16,100 --> 01:26:21,470 but at least they argue in their literature that it provides information that can be compared across languages. 725 01:26:22,130 --> 01:26:27,030 The five categories in particular surface syntax, morphology, word, cohesion. 726 01:26:27,050 --> 01:26:31,910 I'm not going to go into defining all of those in great detail, but we can talk about it in the end, if you like. 727 01:26:33,380 --> 01:26:36,470 And this is how we got to our final set, basically. 728 01:26:36,590 --> 01:26:38,770 It's very standard practise in machine learning. 729 01:26:38,780 --> 01:26:44,510 You get rid of any indicators that don't have any variance because they're not going to predict anything for you. 730 01:26:44,750 --> 01:26:50,420 We also got rid of any variables that showed more than a sort of trivial amount of missing as well. 731 01:26:50,990 --> 01:26:55,100 And then we ended up with 151 indices at the end. 732 01:26:55,610 --> 01:26:59,020 So in terms of what the outcome variable is, if you can picture it. 733 01:26:59,030 --> 01:27:05,870 So for each of the languages, we have text complexity indices for each of the items. 734 01:27:05,870 --> 01:27:08,300 So we have it in English, French and Spanish. 735 01:27:08,600 --> 01:27:14,810 Now we just did this in a kind of brute force, slightly dumb way, where we just literally got all the text for a question, 736 01:27:15,290 --> 01:27:21,409 bundled it together and ran the analysis on it, which again is a point I'll come back to at the end. 737 01:27:21,410 --> 01:27:29,420 But it's a starting point. We have to start somewhere. And so then the outcome variable then becomes the difference between the complexity index, 738 01:27:29,420 --> 01:27:34,220 for English versus French, and the difference between the complexity index for English versus Spanish. 739 01:27:34,450 --> 01:27:37,430 Okay. Because we have a DIF estimate for both of those pairings. 740 01:27:37,640 --> 01:27:41,990 English being the reference group in Spanish and French being the two focal groups. 741 01:27:43,390 --> 01:27:47,950 So this is just to show you that across the different categories there was still 742 01:27:48,700 --> 01:27:54,310 the reference in all the various indices represented the various categories quite well. 743 01:27:56,740 --> 01:28:03,700 So as I said before, a machine learning model is designed to optimise optimal sets of features to predict an outcome variable. 744 01:28:04,300 --> 01:28:09,190 How do you do that? How do you build the model? It's based on evaluation of different criteria. 745 01:28:09,250 --> 01:28:14,740 So basically, whether this lower prediction error rather or greater explanatory power. 746 01:28:15,220 --> 01:28:18,580 And in particular, I'd say what differentiates, say, 747 01:28:18,580 --> 01:28:25,899 machine learning or data science from just normal applications of statistics is that there's a lot of emphasis placed 748 01:28:25,900 --> 01:28:32,920 on cross-validation to avoid the overfitting of the data and to enhance the generalisability of the findings. 749 01:28:33,160 --> 01:28:39,430 And so if anyone ever tries to tell you that machine learning is something different to statistics, tell them they're lying, 750 01:28:39,430 --> 01:28:45,700 but just say, you know, there's a nuance in that, that there's a much stronger emphasis on this topic of cross-validation. 751 01:28:46,690 --> 01:28:51,009 So how do we cross validate things? Well, first and foremost, we have a training and test sample. 752 01:28:51,010 --> 01:28:56,710 So we're going to build a model that makes a certain set of predictions that's going to be based on the training set of data. 753 01:28:56,830 --> 01:29:00,460 Then we get the final model and then we apply it to a test set of data. 754 01:29:00,550 --> 01:29:07,720 The model has never seen before in order to get a unbiased view of whether the model performs well or not. 755 01:29:08,080 --> 01:29:11,980 And when we're training the data itself, we do something very similar, 756 01:29:12,070 --> 01:29:16,630 although a little bit more convoluted by way of what's referred to as K-fold cross-validation. 757 01:29:17,950 --> 01:29:25,450 Dividing the training data up into, in this case, five folds each time you train it on four of the folds test it on the face. 758 01:29:25,450 --> 01:29:30,160 Right? You iterate that around and then you get the final averages that gives you the model performance. 759 01:29:32,160 --> 01:29:40,239 So. We’re using those 150 indices or the different indices, the five categorical test characteristics. 760 01:29:40,240 --> 01:29:45,790 And we were looking at 999 standardised DIF estimates, 495 across the two languages. 761 01:29:46,840 --> 01:29:49,950 What we were using, we used both random forests regression. 762 01:29:49,960 --> 01:29:57,930 So this was for when DIF was a continuous outcome and also ordinal random forest regression for DIF as an audit category outcome. 763 01:29:57,940 --> 01:30:01,060 Remember what I said the small, medium, large. 764 01:30:01,870 --> 01:30:09,090 Now I don't think I'm going to have time to teach you what random forest regression is, but it is actually a pretty cool idea, to be honest. 765 01:30:09,100 --> 01:30:12,190 It's referred to as an ensemble learning approach. 766 01:30:12,190 --> 01:30:14,530 And the reason why is that what we have here, 767 01:30:14,530 --> 01:30:22,900 we have different decision trees and these are set up such that they are necessarily going to be different solutions to the problem, 768 01:30:23,140 --> 01:30:26,620 which then essentially get averaged at the end to give you a prediction. 769 01:30:27,010 --> 01:30:33,160 The great thing about it, it's nonparametric, it's non-linear, it's able to model very complex interactions. 770 01:30:33,400 --> 01:30:39,430 It can easily deal with both categorical and continuous predictors and it's robust against multicollinearity. 771 01:30:40,090 --> 01:30:43,180 The sorts of problems we routinely deal with in our world, 772 01:30:43,180 --> 01:30:49,230 especially when you've got 151 different text indices that are all only very subtly different from one another, 773 01:30:49,690 --> 01:30:54,130 and it's commonly used in machine learning circles, and it's somewhat interpretable. 774 01:30:57,300 --> 01:31:01,770 So I was going to teach you what a decision tree was, but let’s see if this works. 775 01:31:02,310 --> 01:31:08,490 So essentially what you're doing is you’re recursively partitioning the data in terms of the predictor 776 01:31:08,490 --> 01:31:15,300 variables so that you can purify the prediction in terms of the dependent variable as quickly as possible. 777 01:31:15,780 --> 01:31:21,460 So in this case, you've got two predictive variables, X and Y right? at the top there. 778 01:31:21,480 --> 01:31:24,800 It's basically looking for the partition. 779 01:31:24,810 --> 01:31:30,180 It can introduce where at the next node, the groups are as purified as possible. 780 01:31:30,180 --> 01:31:36,600 So let's say we want to predict whether it's a cat or a dog based on two predictor variables you've got. 781 01:31:36,600 --> 01:31:40,559 Does it have pointy years or it's white and all? 782 01:31:40,560 --> 01:31:46,469 Basically the algorithm will cycle through different levels of these predictive variables to essentially 783 01:31:46,470 --> 01:31:53,700 give you the initial split that will purify your beginning set of cats and dogs into something that's purer. 784 01:31:54,000 --> 01:31:57,930 So this one has more cats and this one has more dogs and so on and so forth. 785 01:31:58,230 --> 01:32:05,670 And it stops doing that when it gets to a point where any further bifurcations don't add any information to the process. 786 01:32:07,690 --> 01:32:12,580 And so why Random Forest? Well, how many trees are you going to have? 787 01:32:12,610 --> 01:32:18,640 So we actually just fixed it to have a thousand trees. What it does that is that it creates new samples. 788 01:32:18,970 --> 01:32:24,610 So it takes the original sample and it creates what's referred to as a bootstrapped sample. 789 01:32:24,790 --> 01:32:28,300 So it randomly samples from the original sample with replacement. 790 01:32:28,540 --> 01:32:36,699 So the rose from the original sample might appear more than once in the bootstrap sample, and it will do that a thousand times with a thousand trees. 791 01:32:36,700 --> 01:32:44,980 So therefore every tree is unique. And then the second thing it does is it has a different number of predicted variables for each tree. 792 01:32:45,430 --> 01:32:47,200 And so how do you determine those numbers? 793 01:32:47,200 --> 01:32:53,530 Well, basically, again, you just brute force it and you say try every value and tell me which one works the best. 794 01:32:54,010 --> 01:33:00,750 So that's what we did. The results? As I said, they're not particularly compelling, but they're also not nothing. 795 01:33:00,760 --> 01:33:06,850 And to be honest, at the beginning of this research project, we were a little bit worried that this would just show up, nothing. 796 01:33:08,050 --> 01:33:15,880 And so you see here in the training set with the K-fold cross-validation, the R-squared ends up being about 0.12. 797 01:33:16,540 --> 01:33:22,730 And interestingly, when we then apply that to the test set, it's more like 0.20. 798 01:33:22,750 --> 01:33:30,790 So normally what you're looking for between training and test is that that number for test, isn’t too much lower than for the training 799 01:33:30,790 --> 01:33:36,880 Something strange is going on here where the test performance is actually much better than the training performance. 800 01:33:37,240 --> 01:33:46,030 So somewhere between 12 and 20% of that variance in that continuous DIF outcome variable is being predicted by this model. 801 01:33:47,660 --> 01:33:51,530 I don't, again, don't want you to read all of this other than to pay attention to the fact that 802 01:33:51,530 --> 01:33:55,429 remember the sort of basic characteristics of the items that were in the model. 803 01:33:55,430 --> 01:34:04,560 They don't appear here. All of these are NLP based predictors and then how does it determine whether something is important or not? 804 01:34:04,580 --> 01:34:12,530 Essentially, it pulls it out of all the models and says, how much of the explained variance do we lose based on pulling out that one predictor? 805 01:34:14,530 --> 01:34:18,249 In terms of predicting the ordinal outcome. 806 01:34:18,250 --> 01:34:26,980 Again, not great, but not nothing. So we again have this improved performance of the model in the test data as opposed to the training data. 807 01:34:27,460 --> 01:34:33,640 Somewhere between 34 and 40% of the language DIF effect size categories were predicted. 808 01:34:33,640 --> 01:34:40,390 But remember, you would predict a successful prediction just based on chance alone, which is where the Kappa comes in. 809 01:34:40,780 --> 01:34:47,020 And by any sort of conventional standards. That's not a very good Kappa, but it's not nothing either. 810 01:34:49,820 --> 01:34:54,980 Again. These are all NLP variables that are doing the heavy lifting in the modelling. 811 01:34:56,340 --> 01:35:02,999 So the models perform somewhat poorly, but they still predicted something, especially for the continuous DIF outcome. 812 01:35:03,000 --> 01:35:12,360 And I want you to keep in mind that when we got the expert reviewers right based on the cApStAn model, you call it as of translation, 813 01:35:12,360 --> 01:35:20,400 and we created a survey out of that and got them to compare the items, Ithink on six different criteria and predicted exactly zilch. 814 01:35:20,690 --> 01:35:25,020 Okay. So relative to that, I think this is doing okay. 815 01:35:25,740 --> 01:35:26,550 And also, 816 01:35:26,970 --> 01:35:35,370 you can see here that the predictors that were doing the heavy lifting do come from the automated natural language processing of the text. 817 01:35:37,200 --> 01:35:42,780 So the future for this work. I personally think these findings show promise; you might disagree. 818 01:35:44,130 --> 01:35:50,020 What do I think we need to do to improve it? I think we need purer and larger estimates of DIF by language. 819 01:35:50,040 --> 01:35:56,069 What do I mean by purer? I think we need datasets which basically remember I said we tried to decontaminate 820 01:35:56,070 --> 01:35:59,550 the DIF estimates by way of having other covariates in the DIF model. 821 01:35:59,820 --> 01:36:06,930 Well, we could only do so much with that. So I think that we need to get other datasets which enable us to do that even more so. 822 01:36:06,940 --> 01:36:11,759 So for example, with international large scale assessments, PISA, PIRLS,within a country, 823 01:36:11,760 --> 01:36:16,049 they'll have people doing the assessment in multiple languages. 824 01:36:16,050 --> 01:36:18,840 And it just enables you, once you build up a dataset like that, 825 01:36:18,840 --> 01:36:24,870 to partial out some of that variance attributable to other factors that contaminates the language DIF, so to speak. 826 01:36:25,560 --> 01:36:30,710 Also, what's this ever really going to work that well for science while science, you know, 827 01:36:31,380 --> 01:36:36,390 particularly when we're talking about physics at what we’d call a Year 12 level in Australia, 828 01:36:36,630 --> 01:36:44,760 there's a lot of mathematics and formulae in there. The same goes for chemistry and biology is a little bit more language heavy but still. 829 01:36:45,330 --> 01:36:51,330 So probably this wasn't the best starting place to apply this something like a reading assessment, we probably worked better. 830 01:36:51,870 --> 01:37:00,060 And also coming back to this idea of us just sort of jumbling all the text together for each item and running at the NLP analysis on that. 831 01:37:00,360 --> 01:37:07,530 It's not a very intelligent approach. We need a more nuanced approach to evaluating text complexity, complexity by item type. 832 01:37:07,890 --> 01:37:14,400 When you think about it, a multiple-choice item is a very, very complex linguistic entity. 833 01:37:14,970 --> 01:37:24,180 Okay? And just the sheer amount of text and its qualities in it only go so much of the way to explaining the complexity of the item, 834 01:37:24,390 --> 01:37:26,730 and same goes for constructive response items. 835 01:37:28,050 --> 01:37:34,650 And as a final point, I come back to the sort of general premise of the whole research and whether it's valid or not. 836 01:37:34,650 --> 01:37:42,180 And I'm sure that there’s some people in the room have ideas on this. So is automatic cross-lingual text complexity assessment even possible? 837 01:37:42,810 --> 01:37:48,090 So we need to build models of text complexity on common corpora as READERBENCH exist right now. 838 01:37:48,270 --> 01:37:53,850 Each of the languages is basing itself on a different corpus, which I don't think is very good. 839 01:37:55,050 --> 01:38:01,090 We need to build models of text complexity, using new language neutral tools for tagging passing so that, you know, 840 01:38:01,140 --> 01:38:07,590 as Art alluded to, the world of NLP, in the research world of NLP like every week is making advances. 841 01:38:07,600 --> 01:38:12,330 So, keeping keeping up to date on what's happening there, 842 01:38:12,690 --> 01:38:18,890 validating the comparability of the complexity categories and these need to be psychometrically evaluated. 843 01:38:18,900 --> 01:38:23,850 So, for me when I saw this that it screams out to me this is a factor analysis problem. 844 01:38:24,360 --> 01:38:32,519 The idea that you have multiple indicators of these categories that are sort of latent latent categories underlying what the indices are about. 845 01:38:32,520 --> 01:38:37,590 And once you approach it from that way, it also immediately leads to questions of, well, 846 01:38:37,740 --> 01:38:42,870 if you get a certain factor analysis solution for one language, do you get it across other languages? 847 01:38:42,870 --> 01:38:49,950 And you can actually start answering the question of whether these indices do work in an invariant way across languages. 848 01:38:50,820 --> 01:38:55,590 I'd say generally it's more plausible within major language families, families than across. 849 01:38:55,590 --> 01:39:04,110 So, you know, Indo European languages within might work, okay, but not necessarily when you bring in other major language family. 850 01:39:04,440 --> 01:39:09,000 And generally it just remains a very interesting empirical question to us.