1
00:00:06,890 --> 00:00:10,720
Dear all, my name is Therese Hopfenbeck.

2
00:00:10,790 --> 00:00:19,820
I'm the Director of Oxford University's Centre of Educational Assessment, and it gives me a real pleasure to welcome you all here today.

3
00:00:20,720 --> 00:00:28,310
I would particularly welcome Professor Art Graesser from US and Professor Sandra Milligan from Australia.

4
00:00:28,940 --> 00:00:36,980
We're extremely glad to have you here. And we also have our own Professor Josh McGrane, which we're really proud of having as one of the speakers.

5
00:00:37,580 --> 00:00:40,630
I will tell more about all of you when I introduce you,

6
00:00:41,480 --> 00:00:50,330
but I will say a particular welcome to all our guests from different universities, nationally and also internationally,

7
00:00:50,840 --> 00:00:58,430
because we have Professor Nils Gilje, Professor Sølvi Lillejord, who flew in from Norway just to be part of this celebration.

8
00:00:58,430 --> 00:01:05,120
And Sølvi has been a longstanding supporter of the centre and also a Fellow of our Department for Education.

9
00:01:05,120 --> 00:01:07,100
So we're really, really happy that you're here.

10
00:01:08,240 --> 00:01:18,170
We would also like to particularly say we're very pleased to have our president from Kellogg College, Jonathan Michie, welcome.

11
00:01:18,260 --> 00:01:25,340
We're really pleased that you're here. And we also would like to say a special welcome to all our DPhil students and master students.

12
00:01:25,700 --> 00:01:36,080
Recognised students, Sarah from Australia, from OUCEA in our department, because you are what makes Oxford a very special place to be.

13
00:01:36,620 --> 00:01:41,570
And we are pleased to be offering face to face events, although a bit hot,

14
00:01:42,620 --> 00:01:47,479
because after the pandemic it's been a lot online and I really hope that you will have

15
00:01:47,480 --> 00:01:52,790
time now in Oxford to thrive and meet each other in person and experience that too.

16
00:01:54,420 --> 00:02:00,390
It is a special welcome this year as we are now celebrating the first OUCEA Annual Lecture since 2019,

17
00:02:01,290 --> 00:02:07,710
before the pandemic and our centre, which was fully funded by a generous funding from Pearson in 2008,

18
00:02:08,100 --> 00:02:15,900
has been a small but thriving centre and each year we have had the pleasure of hosting some of our world leading professors to give a talk.

19
00:02:16,710 --> 00:02:25,320
Previous lectures have included David Andrich from Australia, Gordon Stobart, UCL UK, Pam Sammons, Oxford UK, Nancy Perry,

20
00:02:25,350 --> 00:02:32,250
British Columbia, Canada and David Pearson, Berkeley, U.S. and Derek Briggs from Colorado Boulder, US.

21
00:02:33,600 --> 00:02:37,700
Little did we know that we had to wait two years to celebrate again.

22
00:02:38,810 --> 00:02:45,200
Therefore, we will allow us to celebrate in appreciation of the funders who have believed in us through the pandemic

23
00:02:45,710 --> 00:02:53,570
And we celebrate our national and international collaborators who stood by us and achieved so much with us in the past challenging years.

24
00:02:54,590 --> 00:03:01,610
I would like to thank you all. The pandemic and lockdown has been hard in different ways for all of us.

25
00:03:02,240 --> 00:03:07,520
And living with uncertainty in academia has a challenge, was a challenge we had to tackle.

26
00:03:08,720 --> 00:03:13,580
I'm proud to say we were able to get through it, although many will say we're still in it.

27
00:03:14,510 --> 00:03:19,310
And I would like to take a few moments to particularly than the team at OUCEA for their commitment,

28
00:03:19,670 --> 00:03:26,600
dedication, good spirit and positivity which led us to successfully secure grants.

29
00:03:26,870 --> 00:03:34,490
Collect data, write research reports, publish articles, and present in different international conferences.

30
00:03:36,110 --> 00:03:42,440
I will use as an example that here today we have Dr. Samantha-Kaye Johnston present.

31
00:03:42,830 --> 00:03:48,260
She worked hard online for more than a year before we actually met in person.

32
00:03:48,410 --> 00:03:51,950
And she represents this kind of team spirit, which I'm talking about.

33
00:03:53,460 --> 00:04:03,090
A particular thank you and appreciation will go to Deputy Director Josh McGrane, who led OUCEA together with me during these challenging times.

34
00:04:03,960 --> 00:04:12,000
We have been able to find a way where collaboration and making our team succeed no matter what was the key to all we did.

35
00:04:12,810 --> 00:04:16,890
And you use the words leading with integrity and compassion, Josh.

36
00:04:17,850 --> 00:04:22,350
And that is what we do and we'll continue to do.

37
00:04:22,890 --> 00:04:27,170
And thank you for that. So these are values I strongly believe in.

38
00:04:27,180 --> 00:04:32,100
And also one reason why Kellogg College became such an important part of our academic life.

39
00:04:32,880 --> 00:04:37,020
We are very thankful to have Kelly President Jonathan Michie here today celebrating with us.

40
00:04:37,470 --> 00:04:43,050
Jonathan is now a Fellow of the Department of Education here in Oxford and an important collaborator for us.

41
00:04:43,350 --> 00:04:51,240
As both Josh and I are members of the college and have attended seminars and events to support their mission on sustainability and education.

42
00:04:51,960 --> 00:04:58,420
You can hardly think of anything more important. And we will continue to support the college in any way we can.

43
00:04:59,190 --> 00:05:07,640
And for those of you visiting Oxford, if you do have time, walk to Kellogg and see the wonderful gardens, take a cup of tea or something in the Hub.

44
00:05:07,650 --> 00:05:14,730
It's beautiful there. But for now, let me introduce our distinguished guests.

45
00:05:14,760 --> 00:05:20,489
Art Graesser is a Professor Emeritus in the Department of Psychology and the Institute of Intelligence

46
00:05:20,490 --> 00:05:25,050
Systems at the University of Memphis and Honorary Research Fellow at the University of Oxford.

47
00:05:25,830 --> 00:05:30,420
His research is in discourse processing, cognitive science and education.

48
00:05:30,720 --> 00:05:37,050
And personally, I had the pleasure of meeting you the first time in Melbourne in 2010 when we were both working on PISA.

49
00:05:38,100 --> 00:05:41,370
He has developed software and learning language and discourse technologies,

50
00:05:41,370 --> 00:05:46,550
including systems that hold a conversation in natural language with computer agents, AutoTutor,

51
00:05:46,570 --> 00:05:51,990
and that analyse text on multiple levels of language and discourse Coh-Metrix 

52
00:05:52,830 --> 00:05:59,280
He served as editor of Discourse, Process and Journal of Education Psychology as president of Society for Text and Discourse,

53
00:05:59,280 --> 00:06:06,540
International Society for Artificial Intelligence in Education and Federation of Associations in Behavioural and Brain Sciences.

54
00:06:06,840 --> 00:06:13,290
And as a member of four panels with the National Academy of Science and four OECD expert panels on problem solving.

55
00:06:13,620 --> 00:06:18,140
PIAAC 2011-21, PISA 2012 2015.

56
00:06:18,840 --> 00:06:22,950
He has received lifetime achievement awards from the American Psychological Association,

57
00:06:22,950 --> 00:06:32,240
McGraw-Hill Education Society for Artificial Intelligence in Education Societies with Text Discourse and University of Memphis, Art.

58
00:06:33,030 --> 00:06:43,050
the floor is yours. And we're so happy to have you. So 100 degree heat Fahrenheit, right?

59
00:06:44,900 --> 00:06:51,260
Well, I come from Memphis and we've had 100 degree heat now for six weeks.

60
00:06:52,190 --> 00:07:01,430
And so I'm used to it if you need any, you know, sort of solutions on how to survive it, just ask me during the break.

61
00:07:03,780 --> 00:07:09,330
So I retired three years ago, September of 2019.

62
00:07:11,790 --> 00:07:19,290
September 2019. A month later, I found out my wife had endometrial cancer.

63
00:07:21,760 --> 00:07:32,280
About a couple of months after that COVID came. The next city I was supposed to go to was a place called Wuhan, China.

64
00:07:33,180 --> 00:07:43,440
Have you ever heard of that? I couldn't go because of my wife's cancer, so I keep telling my wife her cancer saved me.

65
00:07:45,570 --> 00:07:52,020
So, some people think I'm not going to retire.

66
00:07:52,050 --> 00:07:56,090
And in fact, I'm at a phase called transitioning.

67
00:07:58,200 --> 00:08:07,650
And so, during the transition, what I've been doing is participating in a number of large scale projects.

68
00:08:08,490 --> 00:08:12,120
I'm not leading any of them. I'm just a team member.

69
00:08:12,720 --> 00:08:19,770
But at this phase in my life, I want to help these large-scale projects succeed.

70
00:08:20,340 --> 00:08:21,930
And what I want to do is just give you.

71
00:08:23,240 --> 00:08:33,980
Very succinct highlights on some of some of these, all of the work is current so I'm not going to delve into the past.

72
00:08:35,120 --> 00:08:40,360
There are three important themes. Technology.

73
00:08:40,600 --> 00:08:44,260
Yes. Multiple disciplines.

74
00:08:44,710 --> 00:08:53,620
Not only being interdisciplinary, but inter-institutional and the intercontinental and not quite interstellar yet.

75
00:08:55,960 --> 00:09:04,970
And also 21st century skills in addition to the maths and the literacy and science,

76
00:09:05,530 --> 00:09:10,720
because 21st century skills, as we know, are an important part of the workforce.

77
00:09:11,560 --> 00:09:16,810
And so we need to change our curriculum to accommodate that.

78
00:09:17,080 --> 00:09:20,620
And assessment is one way to drive that.

79
00:09:22,630 --> 00:09:29,350
This is a recent article I have in the Annual Review of Psychology.

80
00:09:31,090 --> 00:09:41,620
I was asked to cover Educational Psychology in 25 pages, so I decided to focus on this particular slant.

81
00:09:43,410 --> 00:09:51,470
So this is an overview. Organised around really the funders.

82
00:09:53,390 --> 00:09:59,930
The first is the OECD projects. I assume everybody knows what PISA and PIACC are.

83
00:10:00,350 --> 00:10:05,870
Does anybody not know what PISA? Raise your hand if you don't know what PISA and PIAAC are.

84
00:10:08,450 --> 00:10:11,810
Well, we'll talk later. Yeah, right. So.

85
00:10:15,580 --> 00:10:24,960
Then I'm going to talk about funding for adult learning out of the Institute of Education Sciences of the US Department of Education.

86
00:10:24,970 --> 00:10:33,210
That's a major research arm of the US Department of Education, and it's all adult learning.

87
00:10:33,280 --> 00:10:36,910
I know for decades they concentrated on K-12,

88
00:10:37,540 --> 00:10:46,899
but it's only recently that they're branching out to adult learning because there are a lot

89
00:10:46,900 --> 00:10:55,570
of struggling adults who can't keep pace with the skills that are needed in the 21st century.

90
00:10:57,540 --> 00:11:00,779
Then a couple of things from the Department of Defence.

91
00:11:00,780 --> 00:11:05,819
The Department of Defence has really led a lot of the learning environments and

92
00:11:05,820 --> 00:11:11,670
assessment in environments because they deal with adults in the military.

93
00:11:11,910 --> 00:11:15,120
And so they've funded a large number of projects.

94
00:11:16,320 --> 00:11:23,310
And then finally, I want to talk a little bit about the NSF funded Learning Data Institute.

95
00:11:25,250 --> 00:11:28,760
So let me start with the OECD projects.

96
00:11:30,250 --> 00:11:33,910
As you know, they have these assessments throughout the world.

97
00:11:34,150 --> 00:11:38,620
You have a number of countries who consistently buy into this.

98
00:11:39,070 --> 00:11:42,430
Others that may some years, but not others.

99
00:11:43,030 --> 00:11:46,390
And the countries have to pay for it.

100
00:11:48,580 --> 00:11:55,870
The whole concept is, of course, that OECD is really interested in the economies of these countries.

101
00:11:56,410 --> 00:12:03,370
And education is a big part to predict the success of economies.

102
00:12:04,030 --> 00:12:15,530
And. Assessment hopefully will drive the curriculum to improve the education of the people in these countries.

103
00:12:16,550 --> 00:12:22,910
As you know, there are these comparisons amongst countries, but you're not supposed to worry about that.

104
00:12:24,500 --> 00:12:32,180
Instead, it's supposed to be driving countries over time in their planning on education and curriculum.

105
00:12:35,450 --> 00:12:44,960
The project I've been recently involved in is the adaptive problem solving for PIAAC 2021.

106
00:12:46,320 --> 00:12:54,030
Actually it's the data are being collected in 2022 because of the COVID and so.

107
00:12:57,540 --> 00:13:03,900
The key thing with adaptive problem solving is the problems are not static.

108
00:13:05,440 --> 00:13:11,290
Traditionally in problem solving, you have a set of givens and then a goal state,

109
00:13:11,560 --> 00:13:18,490
and then you try to figure out how to solve the problem to get you from the givens state to the goal state.

110
00:13:18,730 --> 00:13:25,000
Right. Well, in adaptive problem solving, the problems change in the middle.

111
00:13:27,040 --> 00:13:32,380
And so you have to change your strategy. So strategy change is an important part.

112
00:13:34,220 --> 00:13:41,390
You have to have better metacognition in order to track what you know and how you deal with it.

113
00:13:42,920 --> 00:13:46,370
And so metacognition is a big part of the assessment.

114
00:13:49,190 --> 00:13:53,360
So they're in the middle of collecting data on this.

115
00:13:54,170 --> 00:14:06,950
So I can't really show you any data. But Samuel Greiff  at Luxembourg is the he and his group are leading the charge.

116
00:14:07,340 --> 00:14:12,560
And then Educational Testing Service is playing the big role in collecting the data.

117
00:14:14,140 --> 00:14:19,990
And so maybe at some point I can talk a little bit about that in the future.

118
00:14:21,520 --> 00:14:24,820
So the other thing I wanted to mention is.

119
00:14:28,000 --> 00:14:35,930
A project led by a couple of projects led by Stuart Elliot.

120
00:14:37,000 --> 00:14:49,570
He's been part of OECD. He also is affiliated with the National Academy of Sciences, Education and Medicine in the United States.

121
00:14:50,030 --> 00:14:53,620
But so what are these reports about?

122
00:14:54,610 --> 00:15:03,460
Well, everybody's worried about people not having the right skills in the future.

123
00:15:04,600 --> 00:15:10,960
And they're also worried about A.I. and robotics tech taking over jobs.

124
00:15:12,220 --> 00:15:20,740
And so one of the projects that I was participated in had a bunch of the experts,

125
00:15:20,740 --> 00:15:38,470
11 of us all about AI and robotics are actually take consider a computer taking the test the tests on literacy, numeracy and problem solving.

126
00:15:39,190 --> 00:15:43,000
And imagine how well the computers would do.

127
00:15:44,360 --> 00:15:54,620
And so each of us, you know, tried to make judgements on all the items and justify our decisions.

128
00:15:55,670 --> 00:16:02,780
And if you have a project or a problem like this windmill problem where it's a

129
00:16:02,780 --> 00:16:08,360
problem saying how many windmills would be needed to replace the nuclear reactor?

130
00:16:08,810 --> 00:16:12,230
And so you have to read the text and understand it.

131
00:16:12,470 --> 00:16:16,130
You have to sometimes look at pictures and interpret them.

132
00:16:16,430 --> 00:16:24,319
You have to understand a table that's down below or a table of data, and you have to integrate all that.

133
00:16:24,320 --> 00:16:28,010
in order to interpret the question and answer the question.

134
00:16:28,730 --> 00:16:33,560
And so you had 11 experts do this and they analysed the data.

135
00:16:34,490 --> 00:16:40,490
They divided the experts into the optimists, the realists and the pessimists.

136
00:16:42,020 --> 00:16:45,680
Well, I was categorised as a pessimist.

137
00:16:46,910 --> 00:16:58,010
Now, we recently redid this about a about four months ago, and the same experts and a few more came and analysed the data.

138
00:16:58,010 --> 00:17:01,070
And I'm now a realist.

139
00:17:01,520 --> 00:17:09,620
So I guess there's either more progress, either I'm better understanding AI or AI has changed.

140
00:17:09,980 --> 00:17:14,090
But interesting project and you can read about it.

141
00:17:15,810 --> 00:17:20,580
The other project that Stuart Elliot is leading is.

142
00:17:22,300 --> 00:17:29,560
Trying to figure out what skills. Well, first of all, what are what is a good taxonomy of skills?

143
00:17:30,670 --> 00:17:38,530
And so he got together people in psychology, psychometrics, people certifying people for jobs.

144
00:17:40,630 --> 00:17:44,470
He also AI people in AI human factors.

145
00:17:44,650 --> 00:17:47,140
They got an ensemble of people.

146
00:17:47,650 --> 00:17:58,630
I was fortunate to be asked to be one of them to and there's a book on that reflects on what they think a good taxonomy of skills are.

147
00:17:59,500 --> 00:18:12,910
And in addition to identifying that taxonomy and the skills of making judgements on what can computers do versus people?

148
00:18:13,800 --> 00:18:17,340
What should people be doing as opposed to computers?

149
00:18:18,120 --> 00:18:27,060
And these are judgements by experts. No real hard data, but kind of interesting and a lot of interesting perspectives.

150
00:18:27,510 --> 00:18:30,899
You can download these from the OECD website.

151
00:18:30,900 --> 00:18:37,470
As you know, the website has, I guess, thousands of documents that you can download.

152
00:18:37,650 --> 00:18:40,650
Some of them you get for free, others you have to pay for now.

153
00:18:40,650 --> 00:18:44,310
But it's fascinating because.

154
00:18:45,460 --> 00:18:55,690
Many people have argued that in the future we want to better understand what should the computer do versus humans do?

155
00:18:56,870 --> 00:19:00,230
And there's a lot of counterintuitive findings.

156
00:19:02,480 --> 00:19:07,190
And exploring that is important.

157
00:19:07,430 --> 00:19:09,340
And this is really the first volume.

158
00:19:09,390 --> 00:19:15,590
They're going to have other volumes in the future, because this is a pressing questions question that people have.

159
00:19:15,860 --> 00:19:19,280
A lot of people are terrified that they're going to lose their jobs.

160
00:19:19,640 --> 00:19:30,530
And as you know, there's been a shift more towards jobs requiring reasoning and problem solving and collaboration, collaborative problem solving.

161
00:19:30,830 --> 00:19:36,920
I know Sandra and I were just talking about the collaborative problem solving, and I was involved in that in PISA.

162
00:19:38,370 --> 00:19:44,810
So so what do you want the computer to do as opposed to humans?

163
00:19:46,470 --> 00:19:51,330
So this is just another problem of.

164
00:19:52,720 --> 00:19:57,130
And, you know, just look you look at all the information here.

165
00:19:57,580 --> 00:20:03,400
A computer would have to identify which columns to get the information.

166
00:20:03,850 --> 00:20:08,860
It's asking which muscles would benefit most if you use the gym bench.

167
00:20:09,370 --> 00:20:18,180
And so you have to look under muscles and then you have to you have to drill down on the right cell in order to answer the question.

168
00:20:18,190 --> 00:20:21,970
And there's a lot of distracting information, and you have to handle that.

169
00:20:24,210 --> 00:20:33,390
Okay. Let me go on to the second part with the projects with the Institute of Education Sciences and Adult Learning.

170
00:20:35,010 --> 00:20:43,920
I was fortunate to be part of a large centre grant, the Centre for the Study of Adult Learning.

171
00:20:44,760 --> 00:20:49,080
That was led by Daphne Greenberg at Georgia State University.

172
00:20:49,110 --> 00:20:57,630
She's really one of the gurus who've been trying to help improve adult literacy throughout her career.

173
00:20:59,160 --> 00:21:05,250
Those are the Co-PI’s, and of that project was a five year project.

174
00:21:05,280 --> 00:21:09,150
It's now continuing under the direction of John Sabatini.

175
00:21:09,810 --> 00:21:14,040
John Sabatini has, we're fortunate.

176
00:21:14,490 --> 00:21:20,910
He joined us at University of Memphis. He used to be at Educational Testing Service, and in fact,

177
00:21:20,910 --> 00:21:35,040
he was the PI on the assessment of comprehension skills on a very large $100 billion investment from the Institute of Education Sciences.

178
00:21:35,790 --> 00:21:37,830
And so he knows a lot about assessment.

179
00:21:38,100 --> 00:21:48,180
But he wanted to break away from ETFs in order to do more research on learning and combining learning with assessment.

180
00:21:49,240 --> 00:21:56,709
Both of us believe that you can't really track learning without good assessment and you

181
00:21:56,710 --> 00:22:03,400
can't just have assessment without considering learning and that you see people nodding.

182
00:22:03,400 --> 00:22:10,270
I think you all agree. And so he if we're fortunate, he came in at the University of Memphis.

183
00:22:12,510 --> 00:22:19,360
So. we had an intervention and it was a hybrid intervention to treat.

184
00:22:22,350 --> 00:22:29,850
Reading skills and especially comprehension skills because 40 to 60 million people,

185
00:22:30,120 --> 00:22:35,790
adults in the United States, don't read at a deep enough level to get a decent job.

186
00:22:36,690 --> 00:22:45,430
They don't read at an eighth-grade level, for example. And in fact, we've collected data in colleges with psychometric tests.

187
00:22:45,840 --> 00:22:52,860
And 38% of the students in college don't read at an eighth-grade level.

188
00:22:53,640 --> 00:22:56,640
And so, you know, it's a problem.

189
00:22:56,640 --> 00:23:01,770
And that squares a way, I think, with the PIAAC and PISA of data, too,

190
00:23:03,120 --> 00:23:11,760
so now it's hard to train teachers how to teach comprehension skills, you can get.

191
00:23:12,990 --> 00:23:18,690
They do very well on decoding and vocabulary, but when it comes to comprehension,

192
00:23:18,690 --> 00:23:28,680
it's very variable and too often not science based and or even tangible ways to teach comprehension.

193
00:23:29,160 --> 00:23:35,760
It's hard. So what we did is built a system AutoTutor with these agents.

194
00:23:36,390 --> 00:23:46,830
And so imagine people reading text on a computer and having these agents, having conversations about the text with the person.

195
00:23:47,550 --> 00:23:57,720
And imagine periodically asking conversation based questions, just asking a question and see how they answer.

196
00:23:59,520 --> 00:24:03,059
Well, that's what we did with AutoTutor. Now.

197
00:24:03,060 --> 00:24:14,610
It was not natural language input because writing is hard and typing is hard and often they don't have the digital literacy skills.

198
00:24:15,450 --> 00:24:28,409
And in fact, both John Sabatini and I are convinced you have to combine the digital skill acquisition with reading comprehension.

199
00:24:28,410 --> 00:24:39,330
Both have to go hand in hand. One challenge is if you look at digital skill acquisition on the computer.

200
00:24:40,820 --> 00:24:47,440
Nearly all of it requires you to know how to read. Well, that's the problem you're trying to solve.

201
00:24:47,450 --> 00:24:51,740
So it's a real chicken and egg problem, a real challenge.

202
00:24:51,950 --> 00:25:00,460
And so we're now building digital training skills for people where you don't have to read much, you know?

203
00:25:03,410 --> 00:25:08,180
So they have these agents or we created a lot of lessons.

204
00:25:08,780 --> 00:25:12,139
Imagine actually we have 30 lessons.

205
00:25:12,140 --> 00:25:24,590
We created some of our own words and sentences like you have how they interpret pronouns or non-literal language or of learning new words.

206
00:25:25,280 --> 00:25:33,220
Then you have others on stories and texts and in many different kind of genres and subgenres.

207
00:25:33,230 --> 00:25:45,590
You get the narrative and expository and persuasive texts and compare and contrast, problem, solution, and each of these modules.

208
00:25:46,040 --> 00:25:50,380
It takes 20 minutes to an hour to complete.

209
00:25:50,900 --> 00:25:56,300
And so it's really a 20 to 30 hour intervention that they have.

210
00:25:57,140 --> 00:26:06,379
And so we built this system and and we were hoping not only to help the students learn better,

211
00:26:06,380 --> 00:26:15,770
but also instructors learn better because they're often not exposed to training for comprehension skills.

212
00:26:16,100 --> 00:26:23,090
Remember, in literacy centres, a lot of these people are volunteers from the community.

213
00:26:24,220 --> 00:26:35,260
And in the past the until recently not a lot of investment from the US government to train adults on reading comprehension skills.

214
00:26:35,740 --> 00:26:43,780
And so both instructors can learn and and also struggling adult readers can learn.

215
00:26:44,670 --> 00:26:55,540
So. This was a hybrid intervention with human instructors as well as AutoTutor on the computer.

216
00:26:56,490 --> 00:27:01,020
However, our part was the attitude part.

217
00:27:02,220 --> 00:27:05,160
AutoTutor can store all these data in the cloud.

218
00:27:06,480 --> 00:27:18,180
Whereas humans, you don't know what really what they're doing unless you had a very large budget to videotape them and transcribe it and analyse it.

219
00:27:18,420 --> 00:27:25,380
That would take forever, of course, whereas the computer, it's all in the cloud, you can immediately see it.

220
00:27:25,770 --> 00:27:27,659
And there's many things you can collect.

221
00:27:27,660 --> 00:27:37,080
You can collect reading time, how long it takes to read the text. accuracy in answering these conversation based questions.

222
00:27:37,320 --> 00:27:47,400
The time it takes to answer the question. Any learner initiatives on asking questions or choosing lessons to take.

223
00:27:49,350 --> 00:27:58,050
There's also measures of the students. You can collect background measures and it's all stored there in the database, on the cloud, in the cloud.

224
00:27:59,250 --> 00:28:09,930
We have psychometric tests, three of them on comprehension skills and other skills, and we analyse the lessons.

225
00:28:09,930 --> 00:28:17,220
We use this Coh-Metrix system to scale text on difficulty and many levels of language.

226
00:28:19,140 --> 00:28:24,740
And so. Lots of data.

227
00:28:25,190 --> 00:28:34,190
And so we've been mining it. In fact, next week I go to educational data mining and artificial intelligence in education.

228
00:28:34,200 --> 00:28:38,300
It's held in Durham. First time I've ever will be in Durham.

229
00:28:38,810 --> 00:28:42,740
And and so we'll be presenting some of our stuff there.

230
00:28:45,220 --> 00:28:54,880
We did a study and with about 253 people, these came out of Toronto and Atlanta,

231
00:28:55,690 --> 00:29:01,750
and these were struggling adult readers below the eighth-grade level.

232
00:29:02,440 --> 00:29:06,250
Typically, they read between the third and fourth grade level.

233
00:29:06,820 --> 00:29:19,030
Okay. So we analysed the data and with AutoTutor and we found by analysing the pattern

234
00:29:19,030 --> 00:29:26,620
of reading times and all of the time to answer questions and things like that.

235
00:29:27,670 --> 00:29:33,250
We did some data, we did some clustering analysis and we found four types.

236
00:29:33,490 --> 00:29:41,470
You have the higher performers. These are relatively that answer questions quickly and are pretty correct.

237
00:29:41,950 --> 00:29:51,390
And then you have conscientious and they're real beneficiaries and looking at effect size and some post-test minus pre-test.

238
00:29:52,900 --> 00:29:56,080
So you really want to leave them alone to do whatever they're doing.

239
00:29:56,710 --> 00:30:02,140
Then you have struggling readers. It's beyond the zone of proximal development for them.

240
00:30:02,650 --> 00:30:06,600
And our intervention was not working on them at all.

241
00:30:06,690 --> 00:30:10,960
You got to do something different. You got to scrap it and do something different.

242
00:30:11,380 --> 00:30:18,220
And then under engaged. Some people try to quickly do things and so they underachieve.

243
00:30:18,220 --> 00:30:23,200
And so you might ask them to have a deeper read, things like that.

244
00:30:24,330 --> 00:30:35,670
So what we're trying to do is mine the data now so we can have a recommender system in order to guide the learner on what to do next.

245
00:30:38,040 --> 00:30:42,070
Let me shift now to the Department of Defence Projects.

246
00:30:44,370 --> 00:30:49,259
Once again, the DOD, the Department of Defence,

247
00:30:49,260 --> 00:31:00,000
US Department of Defence has taken the lead over the last 30 years on building these intelligent tutoring systems.

248
00:31:01,680 --> 00:31:07,350
And other advanced learning environments, including virtual reality and augmented reality.

249
00:31:07,890 --> 00:31:17,550
And and concentrating on adults. And so we've been involved in projects with them in the last 30 years.

250
00:31:18,450 --> 00:31:21,540
Let me just mention a couple of.

251
00:31:21,900 --> 00:31:33,540
One is, during the last decade, we have been developing this generalised intelligence framework for tutoring.

252
00:31:34,320 --> 00:31:38,440
It's called GIFT, for short.

253
00:31:39,250 --> 00:31:45,910
And each year we gather a bunch of experts on some aspect of building these systems.

254
00:31:46,720 --> 00:31:54,940
And we get together and we write a book, there's a presentation, and we write a book and you can get the book for free.

255
00:31:55,840 --> 00:31:59,110
If you go to gifttutoring.org

256
00:32:01,120 --> 00:32:12,720
And it's really a good snapshot if you want to find out what the latest is on intelligent tutoring systems, and that involves team training, too.

257
00:32:14,350 --> 00:32:15,820
You can go to this site.

258
00:32:16,180 --> 00:32:29,770
Over 300 experts throughout the world have contributed to the GIFT or expert workshops and represent all sorts of advanced learning environments,

259
00:32:29,770 --> 00:32:36,760
not just intelligent tutoring systems. And so the first year, their focus was on learning or modelling.

260
00:32:37,780 --> 00:32:46,719
Then on instructional strategies, then on authoring tools, then domain knowledge, then assessment.

261
00:32:46,720 --> 00:32:53,770
The assessment what one was held at Educational Testing Service, then one on teams.

262
00:32:54,580 --> 00:33:02,350
We had people involved in the collaborative problem solving there, then self-improving systems.

263
00:33:02,350 --> 00:33:14,230
These are systems where the computer improves during the evolution of building a system and as well as learners improve and educators improve,

264
00:33:14,920 --> 00:33:27,040
it's kind of a co-involving learning on all parts then data visualisation and recently competency-based scenario design.

265
00:33:27,040 --> 00:33:28,869
So you're free to go there.

266
00:33:28,870 --> 00:33:42,220
This was a team team up between University of Memphis and the Army, and so we're real happy with that community of people building the systems.

267
00:33:42,670 --> 00:33:48,370
Traditionally, these systems have been expensive to build, very expensive.

268
00:33:48,370 --> 00:33:55,810
So part of the goal is to make them be created faster, cheaper, things like that.

269
00:33:58,410 --> 00:34:01,440
Now you might ask, why did they team with Memphis?

270
00:34:01,980 --> 00:34:10,920
That's because we're known in our Institute for Intelligent Systems for in building a lot of these systems with intelligent agents,

271
00:34:11,430 --> 00:34:17,009
these conversational agents. Some of them, they hold conversation in natural language.

272
00:34:17,010 --> 00:34:25,590
So you try to interpret what people say in natural language and respond to get them to productively learn.

273
00:34:26,310 --> 00:34:37,560
Now, I will say this the computer does not perfectly understand the human. Most humans don't perfectly understand other humans.

274
00:34:38,040 --> 00:34:41,940
So you want to compare them. But so.

275
00:34:42,030 --> 00:34:46,650
But they the agents can have dialogue moves to help people learn.

276
00:34:46,830 --> 00:34:47,700
That's the idea.

277
00:34:49,800 --> 00:34:58,350
One of the I'm going to talk about electronics tutor briefly, but the other one we like is the personal assistant for lifelong learning.

278
00:34:58,860 --> 00:35:05,870
Imagine if you had an agent that followed you throughout your career and collect all these information.

279
00:35:05,880 --> 00:35:10,260
It's almost like a learning portfolio, a digital learning portfolio.

280
00:35:10,680 --> 00:35:14,220
Imagine that. And then it comes up with recommendations.

281
00:35:14,760 --> 00:35:22,020
And the idea is, if you have to be an expert in a topic, it will give you a problem of the day.

282
00:35:22,980 --> 00:35:30,600
And that problem tries to patch misconceptions who might have had or advance your skill.

283
00:35:31,050 --> 00:35:35,130
So imagine a 20 minute a day refresher.

284
00:35:35,370 --> 00:35:38,580
So you stay current as an expert in your field.

285
00:35:39,300 --> 00:35:43,680
And that helps people prevent skill decay.

286
00:35:44,430 --> 00:35:52,770
That's a big problem, especially if you have a class where they throw all this information at you for 40 hours,

287
00:35:53,400 --> 00:36:01,470
distributed over whatever period of time. And if you look at people six months later, a lot of it is forgotten.

288
00:36:02,190 --> 00:36:07,020
So the goal is to have some activity to keep up your skills.

289
00:36:07,380 --> 00:36:18,900
And we actually prevented skill decay in an area of electronics, which is very important in the Department of Defence.

290
00:36:20,930 --> 00:36:25,570
This is one of the projects I led and it's still being used.

291
00:36:25,580 --> 00:36:39,860
It's actually ElectronixTutor is being used in the nuclear power command training for people who want to go into the Navy in that area.

292
00:36:40,460 --> 00:36:51,680
That's not nuclear bombs, by the way. It's nuclear power to energise the ships and so of lots of institutions.

293
00:36:52,390 --> 00:36:56,150
University of Southern California, Memphis was the lead.

294
00:36:56,480 --> 00:37:07,700
And then we had people from Arizona State University, WPI, we had industry involved from Raytheon and so on.

295
00:37:08,360 --> 00:37:15,170
This is the most co-authors I've ever had on an article because I wanted everybody included.

296
00:37:15,920 --> 00:37:22,640
And you've probably heard physics have 500 authors on some paper.

297
00:37:22,850 --> 00:37:27,170
I felt I was drifting towards that that issue.

298
00:37:28,640 --> 00:37:32,920
But. Imagine.

299
00:37:34,340 --> 00:37:40,720
A really a federation of intelligent tutoring systems and learning environments.

300
00:37:41,560 --> 00:37:49,780
Imagine just reading texts like the Navy documents or AutoTutor conversational.

301
00:37:50,620 --> 00:38:03,820
And then you have learn form doing reasoning with mathematics or assessments that trains people on like Kirchhoff law, very mathematical in Ohm's Law.

302
00:38:04,450 --> 00:38:13,690
And then you have dragoon a deep sort of mental models on electronics, very deep simulation sort of stuff.

303
00:38:13,990 --> 00:38:20,620
So it's a collection of intelligent tutoring systems, all integrated in one environment.

304
00:38:21,790 --> 00:38:25,840
And you track all the activities stored in the cloud.

305
00:38:26,260 --> 00:38:40,000
And and you kind of see hopefully the long-term vision is to see what what people will learn and what environments do they choose to use.

306
00:38:40,450 --> 00:38:43,870
Do they want a visual or simulation?

307
00:38:44,170 --> 00:38:54,940
Of course, we already know people try to avoid hard work and so they often don't wisely choose things if they do it in a self-regulated way.

308
00:38:57,080 --> 00:39:09,660
But. You can have a recommender system to guide them and that may help them stay on a good path.

309
00:39:11,940 --> 00:39:24,600
You might ask. Well, you collect a lot of measures, including affect, your can track emotions and engagement.

310
00:39:26,430 --> 00:39:32,310
Excuse me. The hot weather is getting to me. You can track how much initiative they take.

311
00:39:35,070 --> 00:39:38,310
You can see if they follow your recommendations.

312
00:39:38,610 --> 00:39:47,600
And all of that is stored in the cloud. I just want to make one point here.

313
00:39:49,230 --> 00:39:52,500
A lot of what we do involves natural language interpretation.

314
00:39:53,870 --> 00:39:58,640
And what we do is see experts how much they agree.

315
00:39:59,620 --> 00:40:04,060
And then we see how much the computer agrees with the experts.

316
00:40:04,990 --> 00:40:13,960
And then what we do is we see how how much can the computer analysis of natural language mimic an expert?

317
00:40:14,260 --> 00:40:24,150
And we find we're almost there. Whether you take a stringent criteria on interpreting the things or lenient, we're almost there.

318
00:40:24,160 --> 00:40:28,660
And this is in electronics. We've found it in other topics as well.

319
00:40:28,930 --> 00:40:38,410
So really assessing natural language and how well it kind of compares to expected information.

320
00:40:38,860 --> 00:40:42,310
Really, the field has advanced being pretty good.

321
00:40:43,320 --> 00:40:53,250
The last thing is the National Science Foundation Learning Data Institute, and we've been working on this during the last three years.

322
00:40:55,050 --> 00:41:05,940
The National Science Foundation is in the middle of building these larger centres, usually for about 20 to $25 million for a five-year period.

323
00:41:06,150 --> 00:41:08,670
And they have to involve many institutions.

324
00:41:09,000 --> 00:41:20,340
And we've been involved in a beginning part of one of these and we have a lot of groups that have to kind of.

325
00:41:22,340 --> 00:41:32,420
Collaborate and analyse data. As you know, data comes in many forms and formats and it's distributed all over the place.

326
00:41:33,230 --> 00:41:40,760
You need good data science on how to cobble together all those separate databases,

327
00:41:41,270 --> 00:41:49,070
and then you need to figure out how to apply advanced quantitative methods to analyse the data,

328
00:41:50,990 --> 00:41:57,770
whether it's machine learning or educational data, mining methods, A.I., whatever they are,

329
00:41:58,130 --> 00:42:05,990
sophisticated psychometrics, whatever it is, and to get more people involved as a community doing this.

330
00:42:06,830 --> 00:42:16,760
So. At this point, I'm involved with all these large groups of people leading, none of them.

331
00:42:17,480 --> 00:42:24,980
And at some point, I will be transitioning from transitioning to a full retirement.

332
00:42:25,310 --> 00:42:31,670
And I just don't know when that will be. My wife does have her views on that.

333
00:42:32,240 --> 00:42:45,670
Well, thank you so much. The next speaker, which will give some reflection on this topic, is Enterprise Professor Sandra Milligan,

334
00:42:46,100 --> 00:42:51,730
who is Director of the Assessment Research Centre at the Melbourne Graduate School of Education, University of Melbourne.

335
00:42:52,390 --> 00:42:57,790
And Sandra has an unusually wide engagement with education, industry and in education and research.

336
00:42:58,390 --> 00:43:06,160
Originally a teacher of science and mathematics, she's also a former Director of curriculum in an Australian state education department and has

337
00:43:06,160 --> 00:43:11,170
held senior research management and governance positions in a range of educational organisations,

338
00:43:11,620 --> 00:43:19,090
including government agencies, not for profits, small start-up businesses and large listed international corporations.

339
00:43:19,930 --> 00:43:26,560
Sandra's current research interests focus on assessment, recognition and warranting of hard to assess learning.

340
00:43:27,220 --> 00:43:34,270
She directs several research partnerships with school networks and organisations working to develop learner profiles for their students.

341
00:43:35,080 --> 00:43:42,969
She is lead author of the Future Proofing Australian Students with new credential report, outlining methods to reliably,

342
00:43:42,970 --> 00:43:49,570
assess and recognise the level of attainment with general capabilities and a recognition of learning success

343
00:43:49,690 --> 00:43:58,120
for all, ensuring trust and utility and a new approach to recognition of learning in senior secondary education in Australia.

344
00:43:58,770 --> 00:44:02,319
So Sandra, we're pleased to welcome you.

345
00:44:02,320 --> 00:44:10,690
The floor is yours. And look, first of all, can I just say it is a huge honour to be here.

346
00:44:11,050 --> 00:44:16,840
I just love being here in Oxford and at the Centre for Educational Assessment.

347
00:44:17,110 --> 00:44:22,629
It is a really well-known organisation globally but also in Australia.

348
00:44:22,630 --> 00:44:30,880
So my associations with Therese and Josh and before that with David Andrich, who is a great mentor of mine,

349
00:44:31,270 --> 00:44:41,919
has made me well know he may not claim to have mentored me, but I claim that he mentored me so he might not want to take the blame.

350
00:44:41,920 --> 00:44:51,520
But so I've long had in my imagination this centre and to be here today celebrating, I'm very honoured.

351
00:44:51,520 --> 00:45:01,630
So thank you very much. Actually, I wanted to take off with a final question to Art finished,

352
00:45:02,170 --> 00:45:18,520
and that is to do with the real importance of aligning what we value with what we teach, with what we want, learnt, with what we assess.

353
00:45:19,300 --> 00:45:26,800
So I'm going to be talking about that today and how we might do that in the context of the work that is

354
00:45:26,800 --> 00:45:36,910
assisting me and possibly other Australians in the assessment and certification in senior secondary.

355
00:45:37,390 --> 00:45:47,900
Now I've been watching over and I've been watching the papers on the post COVID and the COVID fallout on the A-levels.

356
00:45:47,920 --> 00:45:50,920
You know, do we need these exams? Should we have exams?

357
00:45:50,920 --> 00:45:54,760
Do they do the right thing? Is that what we should be doing?

358
00:45:55,000 --> 00:46:02,650
Those issues are being mirrored everywhere, at least in the Anglosphere, certainly in Australia.

359
00:46:03,790 --> 00:46:10,420
In the Australian context, people, including my colleagues at the University of Melbourne,

360
00:46:11,350 --> 00:46:21,030
who we say is the Oxford of Australia or possibly the Harvard, depending on which country we're talking.

361
00:46:22,150 --> 00:46:27,430
So my colleagues who run the medical schools, the engineering schools,

362
00:46:27,850 --> 00:46:34,210
what they're saying is the senior secondary examination and certification system is not doing its work.

363
00:46:35,050 --> 00:46:41,650
It's not giving us the people who are going to make the great doctors who will love to be engineers.

364
00:46:41,980 --> 00:46:48,100
It's giving us the high scores on examinations, and that's not who we need.

365
00:46:48,850 --> 00:46:54,819
Some of the people who are high scorers are not going to make the great doctors and

366
00:46:54,820 --> 00:47:00,910
some people who are low scorers could be brilliant engineers if given the chance,

367
00:47:01,030 --> 00:47:09,220
etc. In addition, in our senior secondary schools, we still can't get 100% of people finishing.

368
00:47:09,520 --> 00:47:12,910
Still 15 to 20% of students don't finish.

369
00:47:13,360 --> 00:47:19,690
That's terrible. Of those who do finish, a fair few of them truant.

370
00:47:20,440 --> 00:47:22,750
Don't turn up, don't care.

371
00:47:23,440 --> 00:47:35,140
Others sit conscientiously in the class, sending themselves into stress spirals and asking themselves, is this the pinnacle?

372
00:47:35,320 --> 00:47:38,410
Of learning that I'm doing in this thing.

373
00:47:38,650 --> 00:47:42,430
Do you see what I mean? Now, I'm not saying Australian education is terrible.

374
00:47:43,180 --> 00:47:48,880
I think it's great. But what I am saying is that there's room for improvement.

375
00:47:49,390 --> 00:47:55,150
And the general consensus in Australia is that there is room for improvement.

376
00:47:55,480 --> 00:47:59,020
We've had a number of reports over the last ten years or so.

377
00:47:59,020 --> 00:48:07,180
Those from Australia will recognise people like to read the Gonski reports, the Shergold reports, the review of the AQF.

378
00:48:07,810 --> 00:48:11,590
They're all saying we need to do something different, guys.

379
00:48:12,730 --> 00:48:18,580
And one of the reasons we don't is because of the assessment system.

380
00:48:19,210 --> 00:48:34,210
So I agree with Art that assessment can be an absolute killer of really good education, but it can also be the thing that can shift things.

381
00:48:34,420 --> 00:48:40,989
So it's better. So I want to tell you a little bit about the work that we're doing in senior

382
00:48:40,990 --> 00:48:47,469
secondary education to shift to what we value so that we get full engagement,

383
00:48:47,470 --> 00:48:52,780
enthusiastic engagement of all students in senior secondary.

384
00:48:53,200 --> 00:49:08,320
That's the goal. Just say something about the basis on which I'm going to make some comments, and that is Therese said, I'm an Enterprise Professor.

385
00:49:08,500 --> 00:49:13,900
I don't know whether you know what that means because I head an Enterprise Unit.

386
00:49:14,260 --> 00:49:17,590
What it actually means is that I have to get my own money.

387
00:49:18,280 --> 00:49:22,780
The university doesn't support us in that regard.

388
00:49:22,810 --> 00:49:27,910
In fact, we pay a tax really to be part of the university.

389
00:49:28,180 --> 00:49:33,370
I think that's possibly the way university research centres are going.

390
00:49:34,030 --> 00:49:40,240
And I am measured not only on revenue but on impact.

391
00:49:41,020 --> 00:49:46,870
So I have to show that the work in our centre is actually making a difference.

392
00:49:47,410 --> 00:49:55,510
The way we do that is that we problem solve for industry folk and most particularly in the area of assessment.

393
00:49:55,930 --> 00:50:02,740
We're very fortunate because we get people who are passionate about education,

394
00:50:03,370 --> 00:50:09,190
who hate the assessment system, who in who believes that there’s a better system.

395
00:50:09,520 --> 00:50:14,710
They come to us and say, will you help us reform assessment so that we can meet our goals?

396
00:50:15,220 --> 00:50:17,980
There are I call them first movers.

397
00:50:18,430 --> 00:50:28,780
So we are totally privileged to work with dedicated people who want to shift the system, and it's our job to do the R&D for them.

398
00:50:29,320 --> 00:50:33,730
What a pleasure. So, okay, what R & D are we doing?

399
00:50:33,760 --> 00:50:35,830
I'm going to go pretty quickly through this.

400
00:50:36,250 --> 00:50:44,680
And I'm assuming that these these slides can be distributed and if there's anything else you want to follow up on those and happy to chat.

401
00:50:45,280 --> 00:50:57,460
But this is it. This is what our first movers over the last in senior secondary have been telling us over the last decade is the objective.

402
00:50:58,180 --> 00:51:04,780
They want to get beyond just knowledge.

403
00:51:05,740 --> 00:51:10,720
They want to get to the point where students have agency in what they're doing.

404
00:51:10,730 --> 00:51:13,270
In other words, they learn what they think is important.

405
00:51:13,900 --> 00:51:23,650
They want to get beyond knowledge to competence so that learners not only know, but know how to do things.

406
00:51:23,890 --> 00:51:29,710
This is a big shift. This is a big shift. And they.

407
00:51:30,010 --> 00:51:34,330
So that's what they want. That little sort of what is it?

408
00:51:34,660 --> 00:51:46,210
Pentagon of hexagons. They're the categories of learning ambition that we've abstracted from our first movers.

409
00:51:46,420 --> 00:51:50,830
And you can see, I don't know if, no I haven't got a pointer.

410
00:51:51,070 --> 00:51:56,230
You can see that two of those on the left hand side, a fairly normal knowledge and know how.

411
00:51:56,240 --> 00:52:02,810
That's the curriculum as we know it. And basic literacies, that's the curriculum as we know it.

412
00:52:03,160 --> 00:52:05,860
But they're wanting to add those other three bits.

413
00:52:06,400 --> 00:52:18,790
They want to teach learners to manage their own learning, not just be drilled to death and coached to the nth degree to pass the exams.

414
00:52:19,570 --> 00:52:30,880
They want to teach students how to connect to sustaining communities that they will need to be part of to thrive.

415
00:52:30,940 --> 00:52:34,720
Thrive is the key word. So that's communities.

416
00:52:34,830 --> 00:52:40,860
Of scholarship communities of work, communities of culture, communities of community,

417
00:52:41,280 --> 00:52:48,000
so that I want to teach students how to embed themselves in those sustaining communities.

418
00:52:48,240 --> 00:52:50,790
And I want to give them the learning stable.

419
00:52:50,790 --> 00:53:04,290
So 21st century skills or general capabilities, because that's the skill set that enables students to build depth and breadth in their learning.

420
00:53:04,440 --> 00:53:09,660
That's what they want to do. And they come to us and say, Well, that's what we want to do.

421
00:53:09,990 --> 00:53:13,650
Can you fix the assessment system so that we can do it?

422
00:53:14,190 --> 00:53:28,980
All right. The first step, of course, is to actually define these learning ambitions in ways that they're teachable, learnable and assessable.

423
00:53:29,520 --> 00:53:38,610
It's very easy for people to drift off into things that aren't really curriculum objectives.

424
00:53:38,760 --> 00:53:47,520
For instance, I mean, I'll probably make enemies by saying this, but if you take the issue of student well-being,

425
00:53:48,660 --> 00:53:59,130
I mean, all sorts of things can cause a student to have very poor sense of self-esteem or well-being at any given time.

426
00:53:59,640 --> 00:54:03,750
It's unlikely that a school can guarantee

427
00:54:05,980 --> 00:54:08,420
well-being of every student.

428
00:54:08,440 --> 00:54:20,110
But they can guarantee that they are teaching the skills that students need to that they will need that will enable them to thrive.

429
00:54:20,440 --> 00:54:34,900
So we spend a lot of time trying to focus people on defining what it is they’re actually wanting to teach that students can learn and that is assessable.

430
00:54:35,200 --> 00:54:45,350
So it's important that this is a curriculum based, not a generalised happiness sort of thing.

431
00:54:45,390 --> 00:54:48,130
Do you see what I mean? I'm not expressing that very well.

432
00:54:50,260 --> 00:54:57,070
For instance, when it comes to agency and learning, I won't go through this, we’ve worked through.

433
00:54:57,430 --> 00:55:09,640
What is that? What are the skill features that you need if you're going to actually establish a learning environment that will give students agency?

434
00:55:09,670 --> 00:55:21,190
So we've done a detailed analysis of many of these skills so that we understand what they are as objects of teaching and learning.

435
00:55:22,540 --> 00:55:31,149
I've just put this quickly up here. I probably will deny having shown you this, because this is a framework that the Federal,

436
00:55:31,150 --> 00:55:41,130
our Commonwealth Government is currently consulting on to be the basic framework for post post-school post-secondary education,

437
00:55:41,980 --> 00:55:52,130
post compulsory senior secondary education, the national framework that will capture what it is those curriculum elements are.

438
00:55:52,180 --> 00:55:57,490
I mean, that's a very interesting document. It's not policy yet,

439
00:55:57,970 --> 00:56:03,430
but I'm hoping this or something like it will soon be policy in Australia and it

440
00:56:03,430 --> 00:56:08,200
says these are the skills that you need as well as the knowledge and the know-how.

441
00:56:09,400 --> 00:56:19,660
If you're a school leaver. At that point, most teachers freak out.

442
00:56:21,350 --> 00:56:28,940
and they say, Oh, well, it's all very well to have a nice framework, but how do you assess those things?

443
00:56:30,020 --> 00:56:39,710
And so we've put a lot of effort into getting teachers to the point where they feel confident and capable in assessing these things.

444
00:56:39,920 --> 00:56:45,080
And you know, this is my favourite sort of picture from Dreyfus.

445
00:56:45,410 --> 00:56:49,460
You might remember the Dreyfus taxonomy of competence.

446
00:56:49,880 --> 00:56:56,990
And so what we've discovered is that for each of these general capabilities or 21st century skills,

447
00:56:57,350 --> 00:57:08,060
you can define a progression with five or six big leaps in the constellation of skills that people have.

448
00:57:08,390 --> 00:57:16,490
And so the challenge for teachers is just to understand those big qualitative leaps,

449
00:57:16,970 --> 00:57:29,770
incompetence that students go through and then learn what to look for to place them on those positions in our final ranking in Australia.

450
00:57:30,050 --> 00:57:34,010
We've gotten a thousand-point scale.

451
00:57:34,100 --> 00:57:40,970
It's called the ATAR, on which every senior secondary graduate is placed.

452
00:57:41,000 --> 00:57:44,090
If you don't believe me, ask the other Australians in the room.

453
00:57:46,310 --> 00:57:54,050
And so when we say hmm, we think that there are five levels, not a thousand levels.

454
00:57:54,350 --> 00:57:58,700
People think we're being radical okay. It's sort of interesting.

455
00:58:00,020 --> 00:58:07,010
This is an example of one of those progressions that show that we've used.

456
00:58:07,250 --> 00:58:11,270
This was one we used in the Philippines.

457
00:58:11,690 --> 00:58:18,440
It's one on collaboration. You might. It's sort of like collaborative problem solving.

458
00:58:18,920 --> 00:58:24,860
But the key thing for a teacher is they recognise those levels.

459
00:58:24,890 --> 00:58:31,550
They recognise that at the lowest level, a student needs to be told what to do with others.

460
00:58:32,750 --> 00:58:44,120
And then in the middle, they get quite used to the idea that they're working with a group and now fit in and and go with the flow. At the top level.

461
00:58:44,510 --> 00:58:52,640
They're the people who organise the environment, get, engage and motivate others to collaborate.

462
00:58:53,030 --> 00:59:01,490
And once you explain that and, and explain the behavioural indicators that go with each of those,

463
00:59:01,760 --> 00:59:06,320
we find that teachers fairly soon relax and say, I can do that.

464
00:59:07,040 --> 00:59:10,220
And this sort of fear factor disappears.

465
00:59:12,890 --> 00:59:16,970
To assess it though, you cannot use examinations.

466
00:59:17,000 --> 00:59:21,860
These are not cognitive skills. You cannot use multiple choice questions.

467
00:59:22,400 --> 00:59:34,940
You can't really use stuff that's written down. It is needs to be performance based, multiple raters, multiple performances,

468
00:59:35,420 --> 00:59:49,850
and so that there can be a human based judgement of the level of attainment against the of the performance of a student.

469
00:59:50,390 --> 00:59:54,700
So getting that organised in a school is a non-trivial task.

470
00:59:54,850 --> 01:00:01,670
I'm always forever happy to talk about this ad nauseum, but I'm going to go pretty quickly here.

471
01:00:02,000 --> 01:00:03,500
Now, this is the coup de grâce.

472
01:00:04,490 --> 01:00:19,220
This is what a senior secondary certificate currently is, and this exists in Australia to capture the attainments of a student.

473
01:00:19,940 --> 01:00:24,020
What you will notice about it, first of all, it's a digital document.

474
01:00:25,730 --> 01:00:32,570
It's got links all through it. So you can see that there are links to statements by Abbie.

475
01:00:33,080 --> 01:00:40,070
You can see that it's got links to her portfolio where you can see all the authentic performances that she's performed.

476
01:00:40,820 --> 01:00:46,850
The chrysanthemum in the middle. We call it a chrysanthemum because no one knows how to spell that.

477
01:00:46,850 --> 01:00:53,300
So it makes sense. Right? But it's a rose graph for those who are technically minded.

478
01:00:53,690 --> 01:00:59,960
We call it a chrysanthemum, and that is a standards based.

479
01:01:01,490 --> 01:01:15,049
highly psychometrically, reliable, judgement-based assessment of the level of attainment of Abbie on the five areas that she was assessed on,

480
01:01:15,050 --> 01:01:21,020
which were quantitative reasoning, knowing how to learn, communication and whatever.

481
01:01:21,680 --> 01:01:31,009
And this has been. This is currently being used in 20 schools in Australia and a number in the US and has now been accepted

482
01:01:31,010 --> 01:01:40,760
by 50% of our Australian universities as the legitimate alternative to the senior secondary assessments.

483
01:01:42,170 --> 01:01:45,530
The kids love it. Everyone loves the chrysanthemum.

484
01:01:46,010 --> 01:01:57,410
But most importantly you can use the metrics underpinning the chrysanthemum to do the sorts of things

485
01:01:57,440 --> 01:02:06,080
that would enable matching of student attainment to opportunities that exist like you might get 99 in ATAR.

486
01:02:06,470 --> 01:02:11,720
But if you've got no empathy, there's no point in going into medicine, etc.

487
01:02:11,730 --> 01:02:17,140
No, that's probably not true, but never mind.

488
01:02:22,820 --> 01:02:32,660
So we're currently working with one of the biggest university selector groups, UAC, University Admissions Centre,

489
01:02:33,500 --> 01:02:42,830
working out what sorts of uses this could be made in universal secondary selection.

490
01:02:42,950 --> 01:02:46,580
This is really interesting stuff, really interesting stuff.

491
01:02:47,750 --> 01:02:50,440
And it's not just the big picture one.

492
01:02:50,450 --> 01:03:03,530
We're involved in a range of new credentialing projects where first movers in Australia are being given support by the University of Melbourne.

493
01:03:03,680 --> 01:03:13,190
University of Melbourne is actually underwriting the quality of the credentials so that we can move this along a bit further.

494
01:03:13,760 --> 01:03:17,900
I'm going to finish now, but I would just say this.

495
01:03:19,250 --> 01:03:27,620
From the point of view of being the Director of the Assessment Research Centre and all those who travel in her, in the centre.

496
01:03:28,070 --> 01:03:40,570
I, we see this sort of activity as a fundamental I call it D an R development and research activity.

497
01:03:40,580 --> 01:03:51,740
It's not traditional research. It's actually trying to solve a problem, coming up with a solution, working out the science that's underneath it,

498
01:03:52,250 --> 01:03:57,920
and then packing the research evidence around it and altering it where it doesn't stack up.

499
01:03:58,490 --> 01:04:05,899
These are the kinds of empirical questions that we're working on with our first

500
01:04:05,900 --> 01:04:12,980
movers and with the very interested observation of the policy people in Australia.

501
01:04:13,520 --> 01:04:28,010
And I'm hopeful that fairly soon we'll see learnt profiles based on things that we value that will help students thrive,

502
01:04:28,640 --> 01:04:37,460
will become the lingua franca of the senior secondary certification system to help us match people to opportunities.

503
01:04:37,760 --> 01:04:40,820
And it will be brilliant. So there you go. Thank you very much.

504
01:04:46,180 --> 01:04:49,450
Thank you so much. That was extremely fascinating.

505
01:04:49,450 --> 01:04:59,440
So now I think we need to listen to a little bit about evaluating a method for predicting the compact

506
01:04:59,500 --> 01:05:04,540
comparability of International Baccalaureate cross-lingual assessments by combining psychometrics,

507
01:05:04,870 --> 01:05:09,189
machine learning and natural language processing. And Josh McGrane,

508
01:05:09,190 --> 01:05:15,790
our Deputy Director of the Oxford University's Centre of Educational Assessment and Senior Research Fellow and Fellow at Kellogg College.

509
01:05:16,390 --> 01:05:22,540
He completed his university medal winning Ph.D. in quantitative psychology at the University of Sydney.

510
01:05:23,500 --> 01:05:29,520
He has also been a post-doctoral fellow at the University of Western Australia and worked as a psychometrician

511
01:05:29,740 --> 01:05:36,970
for the Centre for Education Statistics and Evaluation in the New South Wales Department of Education.

512
01:05:38,200 --> 01:05:43,420
As a result, he has extensive experience in academic and government contexts in education assessment,

513
01:05:43,810 --> 01:05:49,720
including being an expert adviser for several national and international government and non-profit organisations.

514
01:05:50,410 --> 01:05:56,530
His research and teaching interests span the philosophical and political and statistical aspects of psychometrics.

515
01:05:57,160 --> 01:06:06,190
He's presently working on several projects, including the development of a better framework for measurement across the physical and social sciences,

516
01:06:06,580 --> 01:06:12,340
combining psychometrics and AI modelling to predict and explain bias and education assessments,

517
01:06:12,760 --> 01:06:20,860
educating critical thinking amongst secondary students internationally and the development of language assessments for understudied languages.

518
01:06:21,400 --> 01:06:28,780
So, Josh, the floor is yours. Thank you, Therese.

519
01:06:28,780 --> 01:06:35,019
So I'm now very much convinced that the weather gods of Great Britain hate me and

520
01:06:35,020 --> 01:06:39,459
they are vengeful because I've now lived for going on six years and most days,

521
01:06:39,460 --> 01:06:43,870
just as the people I've worked with have complained about the weather and what they've done,

522
01:06:44,140 --> 01:06:48,910
they've turned around on the one day when it's supposed to be our last hurrah and celebration and

523
01:06:48,910 --> 01:06:53,350
given me a day of Australian weather so that half of the British people didn't show up today.

524
01:06:53,350 --> 01:06:57,100
So thank you for your vengeful irony in that regard.

525
01:06:57,640 --> 01:07:01,330
Now I just want to say thank you to the two speakers that have come before me.

526
01:07:01,330 --> 01:07:08,229
It's both been wonderful talks and funnily enough, a lot of overlap with what I'm about to say in sort of nuanced ways.

527
01:07:08,230 --> 01:07:13,209
But I will say that we're going from sort of big thinking, big picture ideas to a very,

528
01:07:13,210 --> 01:07:18,790
very small-scale research study that really is only getting underway now.

529
01:07:19,000 --> 01:07:26,020
In fact, I'd say we're about two metres into the 100 metre sprint and we probably stumbled a little bit out of the blocks.

530
01:07:26,020 --> 01:07:31,860
Okay, so spoiler alert, the findings aren't that amazing, but I wanted to talk about something more.

531
01:07:31,870 --> 01:07:35,049
We're talking about the future of assessment, the future of research.

532
01:07:35,050 --> 01:07:42,400
Well, often in these kinds of talks, you get people basically reflecting on things they've been doing for the last ten years.

533
01:07:42,400 --> 01:07:46,270
That makes it seem also simple and neat and linear, and it all goes together.

534
01:07:46,540 --> 01:07:49,749
Well, I can assure you, when you're on the ground actually taking risks,

535
01:07:49,750 --> 01:07:55,329
having no idea whether these small but big ideas are going to work out that it's not that neat.

536
01:07:55,330 --> 01:07:59,440
And I think that's an important lesson for the students in the room to learn in particular.

537
01:07:59,890 --> 01:08:04,870
And on that note, I just want to thank people who have made the effort to come today despite the heat.

538
01:08:04,870 --> 01:08:08,320
So, Jonathan Michie, the president of Kellogg College, thank you for coming along.

539
01:08:09,160 --> 01:08:14,290
I see several of my colleagues in the room as well, so thank you for making the effort as well our centre,

540
01:08:14,290 --> 01:08:21,040
both present and former and friends and colleagues and what have you.

541
01:08:21,040 --> 01:08:26,650
So thank you to everyone for coming along and listening to me talk about this rather nerdy stuff that you're about to hear.

542
01:08:27,040 --> 01:08:32,120
So. What we just to give you some context.

543
01:08:32,140 --> 01:08:38,950
We were contracted by the International Baccalaureate Organisation to do a study looking at whether their translation

544
01:08:38,950 --> 01:08:45,850
processes were working well in their Diploma Programme examinations and specifically within their science programme.

545
01:08:45,890 --> 01:08:53,290
So interestingly, the Diploma Programme qualification I see has a lot of overlaps with what you were showing

546
01:08:53,290 --> 01:08:58,390
that the Australian Federal Department is considering putting in as actual policy now.

547
01:08:58,420 --> 01:09:02,110
So I hope that those two bodies are in a lot of conversation with one another.

548
01:09:03,520 --> 01:09:08,680
But one unique aspect of the International Baccalaureate that isn't present in

549
01:09:09,040 --> 01:09:13,959
many national systems is the fact that you have to make your assessments valid,

550
01:09:13,960 --> 01:09:18,690
comparable, unbiased across several languages, in this case as many languages.

551
01:09:18,700 --> 01:09:22,810
But in this particular research we would be looking at three of them.

552
01:09:23,320 --> 01:09:26,950
And so normally when asking this sort of question,

553
01:09:27,850 --> 01:09:33,879
Are examinations comparable across across languages will be doing the typical things we do within assessment we’ll first and

554
01:09:33,880 --> 01:09:45,430
foremost we’ll be having a look at the translation processes and evaluating whether they meet best practise and what have you.

555
01:09:46,410 --> 01:09:54,120
Also, we might have actual human experts look at the different exam scripts and make judgements about whether

556
01:09:54,120 --> 01:09:59,090
they would predict based on differences and nuanced differences in the language that they see,

557
01:09:59,100 --> 01:10:04,710
whether you might predict there to be differences in the performance of the items across those languages.

558
01:10:04,980 --> 01:10:09,320
Now, as part of this broader research project, we did look at both of those questions.

559
01:10:09,330 --> 01:10:11,399
Yasmine, sitting in the crowd here in particular.

560
01:10:11,400 --> 01:10:17,250
did, but I'm not really going to be touching on that at all, just given the amount of time that I have available to me.

561
01:10:17,640 --> 01:10:22,530
And so what I am going to be focussed on focussing on is the part of the research study that I was driving,

562
01:10:23,190 --> 01:10:28,379
which was looking at can you actually use AI essentially to predict differences

563
01:10:28,380 --> 01:10:33,180
in performances of items across multiple languages for high stakes assessment?

564
01:10:34,930 --> 01:10:40,690
Now that's an awful mouthful to get through as a title, so I thought I better come up with another one.

565
01:10:40,930 --> 01:10:45,340
And for those of you familiar with the work of Stanley Kubrick and Dr. Strangelove,

566
01:10:45,730 --> 01:10:50,140
an alternative title would be how I learnt to stop worrying and love statistics (and AI).

567
01:10:50,320 --> 01:10:57,520
And now, of course, I have my tongue firmly in my cheek here because we have a quote from Borsboom,

568
01:10:57,520 --> 01:11:05,259
from the University of Amsterdam and one of his colleagues where he refers to the psychological test or now context.

569
01:11:05,260 --> 01:11:09,880
We might think of it more broadly as the educational assessment as being the atomic bomb.

570
01:11:09,940 --> 01:11:13,780
Now, interestingly, we had atomic bombs come up in one of the earlier speeches.

571
01:11:14,110 --> 01:11:17,380
Now, clearly, he's using a degree of hyperbole there.

572
01:11:17,590 --> 01:11:23,319
But what he's really getting at is that this is the one thing that assessment is giving the world both

573
01:11:23,320 --> 01:11:30,340
psychological and an educational assessment that has actually radically changed global power structures.

574
01:11:30,700 --> 01:11:34,510
And so moving into the future of assessment, me as a psychometricia,

575
01:11:34,900 --> 01:11:44,260
I know how much power I am often asked to imbue in assessments that magically those magical terms of is it valid and reliable?

576
01:11:44,480 --> 01:11:52,660
Okay, I put it through my statistical models, spit out the answers at the other end and say, yes, it's kind of valid, it's kind of reliable.

577
01:11:52,900 --> 01:11:57,100
And then you say it's valid, reliable, and go forth and change people's lives with it.

578
01:11:57,130 --> 01:12:04,050
So for me, I just want to make clear that although I'm mashing up a lot of contentious areas here,

579
01:12:04,060 --> 01:12:07,390
psychometrics in itself can be quite contentious and reductive,

580
01:12:07,570 --> 01:12:13,479
along with machine learning and all the perils of A.I. that I'm sure most of us in the room are aware of,

581
01:12:13,480 --> 01:12:19,510
that it is done with consideration of the actual ethical implications involved.

582
01:12:19,750 --> 01:12:25,540
I hope that this is a fairly benign application, but I'm happy to hear feedback on that.

583
01:12:26,740 --> 01:12:31,510
But I would just take a quick segue from the actual topic, because I think it is important,

584
01:12:31,510 --> 01:12:39,159
given that this is the last annual lecture that will be hosted by Therese and I that I do acknowledge the host herself.

585
01:12:39,160 --> 01:12:44,860
And first and foremost, I should have Sam's photo up here as well because they both organised the event today.

586
01:12:44,860 --> 01:12:46,450
So thank you on that behalf.

587
01:12:47,140 --> 01:12:53,550
But Therese is someone who spends a lot of time acknowledging others and very much diverts acknowledgement away from herself.

588
01:12:53,560 --> 01:12:57,400
So I am probably embarrassing the crap out of her right now, but that's okay.

589
01:12:57,430 --> 01:13:03,520
It's something that I want to do, Therese has spoken about the achievements that

590
01:13:03,520 --> 01:13:07,870
we've achieved together and collectively as a centre over the last five years.

591
01:13:07,870 --> 01:13:12,910
And I think it's a testament to both us as a team, but particularly her as a leader,

592
01:13:13,270 --> 01:13:18,930
that a lot of that has happened in the last two and a half years and the last two and a half years have been very difficult.

593
01:13:18,970 --> 01:13:22,629
I'm not sure if anyone has noticed, you know, on several, several levels.

594
01:13:22,630 --> 01:13:30,190
And so I just wanted to say to you Therese, you brought up the topics of compassion and integrity,

595
01:13:30,460 --> 01:13:34,930
well I could not think of a leader who embodies those two principles more than you do.

596
01:13:34,930 --> 01:13:39,159
So it's my pleasure to have you as a leader. My pleasure to have you as a colleague.

597
01:13:39,160 --> 01:13:41,170
And my pleasure to have you as a friend. So thank you.

598
01:13:48,900 --> 01:13:56,940
Also, just as it takes a whole village to raise a child, it takes a whole research team to do a set of involved research projects like this.

599
01:13:57,330 --> 01:14:01,830
And so I just wanted to give a shout out here to Yasmine El-Masri, who's in the crowd today.

600
01:14:01,830 --> 01:14:08,070
As I said, she actually conducted a large part of this broader research study that I'm not going to be talking about today.

601
01:14:08,700 --> 01:14:16,970
And also, you know, it's her work and her connection to Art Graesser and the work of Coh-Metrix, which is part of the kernel of the idea in the first place.

602
01:14:16,980 --> 01:14:23,940
So I just wanted to acknowledge that too. Kit Double, another former post-doc of us who's fortunate enough to already be back in Australia,

603
01:14:26,250 --> 01:14:32,850
who, you know, did a lot of the dogsbody work of having to code and clean and what have you, which is like the unsexy part of research,

604
01:14:32,850 --> 01:14:38,130
but it's extremely important and people don't really acknowledge just how important it is often enough.

605
01:14:39,300 --> 01:14:48,330
Heather Kayton, who basically has been my right hand person, I guess I should say, throughout this particular aspect of the project in particular.

606
01:14:48,330 --> 01:14:54,180
And I'm really thankful to the support and also the expertise given her background in applied linguistics,

607
01:14:54,180 --> 01:15:02,370
helping me to really understand what all this natural language process, blah blah actually means in terms of real language stuff.

608
01:15:03,300 --> 01:15:09,540
So she has really, really helped me along the way and I hope that you'll continue with this.

609
01:15:10,530 --> 01:15:16,950
And then finally, Rebecca Hamer, who was the project coordinator from the IB side of things.

610
01:15:17,400 --> 01:15:18,750
She's based in The Hague.

611
01:15:18,750 --> 01:15:26,160
And I must admit, she did bust my balls a lot throughout this project, but sometimes that is required from a project manager.

612
01:15:26,310 --> 01:15:36,380
But she also helped an awful lot too, in terms of providing data, processing it and providing support by way of expert judges from the IB as well.

613
01:15:36,390 --> 01:15:41,040
So I just shout out to her because I do appreciate the input that she had to the project.

614
01:15:41,460 --> 01:15:48,840
Now really returning to the actual substance of the talk now and to just to create a little bit of synchronicity with Sandra.

615
01:15:49,410 --> 01:15:57,180
One last acknowledgement is a mentor of mine, and I think he would actually acknowledge that I was mentored by him, or at least I hope so.

616
01:15:57,180 --> 01:16:00,210
He was a reference to my job interview, so I guess he said nice things.

617
01:16:02,010 --> 01:16:11,459
Where is this line of research for me is coming from is David Andrich’s view of Rasch’s measurement theory and that's the

618
01:16:11,460 --> 01:16:18,210
idea that we have these psychometric models that we apply because they have good and desirable measurement principles.

619
01:16:18,480 --> 01:16:22,290
And when we find problems, what he often refers to as anomalies,

620
01:16:22,410 --> 01:16:29,190
it's not an indication that we should go and get a different, fancier, better, more uninterpretable model.

621
01:16:29,400 --> 01:16:35,190
But rather what we should do is take that as a lesson that perhaps there's something wrong with the assessment itself,

622
01:16:35,490 --> 01:16:39,150
whether that be in the administration or the substance of it.

623
01:16:39,360 --> 01:16:41,880
And so this is where I'm coming from here. Okay.

624
01:16:42,210 --> 01:16:49,170
We've had 1,000,001 studies in the world about multilingual assessments showing cross-lingual assessments.

625
01:16:49,170 --> 01:16:55,140
Sorry, Heather, I’m make you cringe in saying multilingual, cross-lingual assessments showing differential item functioning.

626
01:16:55,860 --> 01:17:04,010
But why? We need to predict and explain so that we can actually take these signs of anomalies and improve the assessments going forward.

627
01:17:06,040 --> 01:17:09,040
So that's the general aim of this particular study.

628
01:17:09,070 --> 01:17:10,430
I'm going to have to hurry up a little bit.

629
01:17:10,450 --> 01:17:20,019
But essentially what we needed to do is basically test out as a first stepping point, an innovative and I'd say, you know,

630
01:17:20,020 --> 01:17:24,190
arguably one of the best aspects of it is that it's scalable,

631
01:17:24,190 --> 01:17:28,690
doesn't rely on you having to go back to humans and then making sense of things and whatnot.

632
01:17:28,690 --> 01:17:32,979
So if it works, it's a scalable approach to predicting language effects,

633
01:17:32,980 --> 01:17:40,600
i.e. these anomalies on item functionally across different language versions of the high stakes IB Diploma Programme Examinations.

634
01:17:40,600 --> 01:17:46,840
For those of you who don't know, the Diploma Programme is actually the end of secondary schooling qualification for the IB.

635
01:17:46,870 --> 01:17:57,050
So the thing that determines whether you go to university or on to other relevant post-secondary pursuits and I, you know,

636
01:17:57,100 --> 01:18:03,909
we have been very fortunate to have the IB as a research funder and partner because they have been open to

637
01:18:03,910 --> 01:18:10,899
us doing more of this sort of blue-sky research in the context of also doing practical things for them,

638
01:18:10,900 --> 01:18:14,320
like reviewing translation processes and what have you.

639
01:18:15,730 --> 01:18:21,940
And so, what we've been what I propose to do is to basically mash up psychometrics,

640
01:18:21,940 --> 01:18:26,589
machine learning and natural language processing to understand the extent to which the

641
01:18:26,590 --> 01:18:32,950
differences in item performance are related to differences in the textual complexity

642
01:18:32,950 --> 01:18:38,079
of the audience across languages and to determine which particular aspects of textual

643
01:18:38,080 --> 01:18:42,890
complexity are the most predictive of any cross language differences that we find.

644
01:18:42,910 --> 01:18:47,110
So in order to understand the rest of the talk, you're just going to have to understand psychometrics,

645
01:18:47,110 --> 01:18:51,489
machine learning and natural language processing. Okay. No, just joking.

646
01:18:51,490 --> 01:18:57,610
Just follow the gist of it. In terms of what data we were dealing with.

647
01:18:57,610 --> 01:19:00,639
Pretty complex stuff when you started breaking it all down,

648
01:19:00,640 --> 01:19:08,620
I must admit we went into the contract not quite realising just how complex what they were asking was, but there's another lesson for future in that.

649
01:19:09,100 --> 01:19:12,159
So we're dealing with 90,000 students.

650
01:19:12,160 --> 01:19:14,709
Most of them responded to the exam in English,

651
01:19:14,710 --> 01:19:22,380
but we have a fair proportion who responded to it in Spanish and a much smaller number who responded to it in French.

652
01:19:22,390 --> 01:19:27,280
What are we talking about here? We're talking about 12 exam papers, Biology, Chemistry and Physics.

653
01:19:27,550 --> 01:19:34,870
So subjects across two calendar years that have these levels of subjects as well as at a higher level than the standard level.

654
01:19:35,110 --> 01:19:41,829
And they actually have three papers within each. But we just looked at the first two because Paper 3 involves optionality,

655
01:19:41,830 --> 01:19:48,820
which just creates even more complexity to deal with Paper 1 being a multiple choice exam and Paper 2 being constructed response.

656
01:19:49,150 --> 01:19:54,040
So all in all, we have 495 items in total across these different factors.

657
01:19:55,600 --> 01:19:59,139
So in order to build a predictive, predictive model, what do we have to do?

658
01:19:59,140 --> 01:20:05,110
So we use the machine learning approach to identify optimal sets of features that predict the outcome variable,

659
01:20:05,110 --> 01:20:10,390
i.e. whether there was differential item functioning or not. I realise I haven't defined that yet, but I will in a minute.

660
01:20:10,900 --> 01:20:15,400
So, this is done using the ‘caret’ package in R. Now, this is just an incredible package.

661
01:20:15,400 --> 01:20:21,910
It still blows my mind with the open-source software, how people create this stuff and distribute it like free of charge.

662
01:20:22,270 --> 01:20:28,059
I recommend any student out there, even if you're just doing a basic linear regression or logistic regression to look

663
01:20:28,060 --> 01:20:32,799
at this package because it allows you to apply what's referred to as cross validation,

664
01:20:32,800 --> 01:20:36,550
which I will get to in a second, which is so rarely done in our field.

665
01:20:36,580 --> 01:20:41,170
And, you know, the data scientist in the crowd is nodding his head.

666
01:20:41,170 --> 01:20:45,820
It's very necessary for us to start implementing that as standard practise in our field.

667
01:20:46,840 --> 01:20:51,100
And so to do this, it's first necessary to establish the outcome variable,

668
01:20:51,100 --> 01:20:56,079
which is the DIF estimates from the TAM package in as well as the predictive variables.

669
01:20:56,080 --> 01:21:01,210
And so where we got this from was an open source software called READERBENCH.

670
01:21:02,380 --> 01:21:08,200
And also we included the test and item characteristics in there as predictive variables as well.

671
01:21:09,850 --> 01:21:18,220
So detecting the DIF, basically differential item functioning is when an item displays differences in difficulty.

672
01:21:19,730 --> 01:21:24,470
Across different groups, we would estimate as having the same underlying ability,

673
01:21:25,340 --> 01:21:30,440
which means that they have a different probability of getting a correct response for that item.

674
01:21:30,740 --> 01:21:35,270
So between groups, between, say, English responders and French responders,

675
01:21:35,270 --> 01:21:41,030
it may well be that there's on average at different level of attainment on a particular item.

676
01:21:41,660 --> 01:21:44,680
So where DIF is different to that, it's not average performance.

677
01:21:44,690 --> 01:21:51,680
It's if we take, say, the French and English speakers who we believe have the same underlying ability level,

678
01:21:51,860 --> 01:21:55,100
is the item differentially difficult for them?

679
01:21:57,300 --> 01:22:04,140
And the particular model that we applied here is the random coefficient multinomial logit Rasch model.

680
01:22:04,830 --> 01:22:10,920
Now you don't need to understand what that is, other than to know that this is a parameterization of the Rasch model.

681
01:22:11,150 --> 01:22:19,740
That's much more like a multilevel model. The reason why that matters is because DIF can be understood in the parameterization of

682
01:22:19,740 --> 01:22:23,430
the model this way as being an interaction between the item difficulty parameter.

683
01:22:23,790 --> 01:22:27,540
And then you can add a group for group parameter into the model.

684
01:22:27,780 --> 01:22:34,290
In addition to that, you can also add in covariance. And this is important because so often when people evaluate DIF,

685
01:22:34,290 --> 01:22:39,090
it's just like this group factor is an analysis followed by this group factor as an analysis.

686
01:22:39,420 --> 01:22:43,979
And the fact is that the language of the test interacts with all sorts of other different group factors.

687
01:22:43,980 --> 01:22:51,100
So, for example. Are they responding to in that language which just so happens to be their native language?

688
01:22:51,100 --> 01:22:53,530
Is it their language at home? Okay.

689
01:22:53,770 --> 01:23:02,200
That would be one key covariate that's going to potentially show up as being language DIF when it's actually nothing to do with language DIF potentially.

690
01:23:02,740 --> 01:23:07,629
And so we didn't really have the sheer amount of data required to delve into that too far.

691
01:23:07,630 --> 01:23:15,490
But we did include some covariates like that, for example, whether the test languages matches the home language as well as gender,

692
01:23:15,640 --> 01:23:19,900
as well as a couple of other factors to try to purify it, so to speak.

693
01:23:20,080 --> 01:23:28,260
The DIF estimate that we had based on language. And so this gives us DIF as both a continuous, standardised outcome,

694
01:23:28,260 --> 01:23:35,340
but it can also be categorised into basically an effect size measurement as well as small, negligible DIF, moderate and large.

695
01:23:36,860 --> 01:23:43,610
And so I really don't want you to glaze over trying to read all of this because I appreciate it's too much information,

696
01:23:43,610 --> 01:23:49,100
but from here on is the DIF that basically is considered to matter.

697
01:23:49,400 --> 01:23:53,870
And you can see here that by and large, the percentages across the different subjects,

698
01:23:53,870 --> 01:23:59,300
across the different levels, across the different languages are pretty small.

699
01:23:59,510 --> 01:24:04,820
So this is a good news story for theIB, basically, that it seems that they are.

700
01:24:04,970 --> 01:24:08,930
And then I'll show you for 2019 as well, it's probably even better.

701
01:24:09,140 --> 01:24:14,150
So by and large, it looks like they're doing a pretty good job of the translation.

702
01:24:14,450 --> 01:24:17,650
It's a bad news story for us, however, because, you know,

703
01:24:17,750 --> 01:24:23,440
machine learning wants lots of observations of different kinds of categories and we don't have them here.

704
01:24:23,450 --> 01:24:32,810
So, I'll come back to that at the end. So generally, the findings were very positive for the IB DP Science examinations.

705
01:24:34,220 --> 01:24:39,260
And as I pointed out, unfortunately, because there are so few items showing moderate and large

706
01:24:39,330 --> 01:24:43,460
DIF, we do have this highly imbalanced data set for later class predictions.

707
01:24:45,810 --> 01:24:49,200
So, the second step is to establish textual complexity.

708
01:24:49,210 --> 01:24:52,620
So, we've got the outcome variable. Now we need the predictive variables.

709
01:24:53,010 --> 01:24:58,950
Now the way that we did that, as I said earlier, was using the textual analysis software, READERBENCH.

710
01:24:59,280 --> 01:25:04,499
Any of you who know about NLP and the prediction of text complexity know that

711
01:25:04,500 --> 01:25:08,100
there's a strong overlap between READERBENCH and what's offered in Art.

712
01:25:08,110 --> 01:25:10,530
Graesser and his colleagues product, Coh-Metrix.

713
01:25:10,860 --> 01:25:17,280
It's just that READERBENCH has been generalised to more languages than Coh-Metrix has and it's also completely open source as well.

714
01:25:17,280 --> 01:25:20,370
So, it was helpful for us in that regard.

715
01:25:20,910 --> 01:25:24,989
And so, I'm not going to go into too much of the technical detail here,

716
01:25:24,990 --> 01:25:30,930
but basically pre-process, pre-process is the text using fairly standard core NLP procedures,

717
01:25:31,200 --> 01:25:38,940
basically, tokenising the bits of the text tagging, the kinds of words, etc. fronted adverbials, and what not.

718
01:25:39,450 --> 01:25:42,870
Michael Gove joke in there and dependency passing,

719
01:25:42,870 --> 01:25:50,040
which is to do with grammatical structures, and it also applies language specific resources as well.

720
01:25:50,040 --> 01:25:56,909
So language specific corpora, the actual body of text that it's drawing upon and the lexical ontologies and semantic models

721
01:25:56,910 --> 01:26:00,810
which have to do with the sort of semantic modelling of relationships of the text as well.

722
01:26:04,550 --> 01:26:11,480
So it provides you with a wide array of indices over 300 for understanding the textual complexity in multiple languages.

723
01:26:11,930 --> 01:26:16,100
And, you know, I think this is still an open empirical question,

724
01:26:16,100 --> 01:26:21,470
but at least they argue in their literature that it provides information that can be compared across languages.

725
01:26:22,130 --> 01:26:27,030
The five categories in particular surface syntax, morphology, word, cohesion.

726
01:26:27,050 --> 01:26:31,910
I'm not going to go into defining all of those in great detail, but we can talk about it in the end, if you like.

727
01:26:33,380 --> 01:26:36,470
And this is how we got to our final set, basically.

728
01:26:36,590 --> 01:26:38,770
It's very standard practise in machine learning.

729
01:26:38,780 --> 01:26:44,510
You get rid of any indicators that don't have any variance because they're not going to predict anything for you.

730
01:26:44,750 --> 01:26:50,420
We also got rid of any variables that showed more than a sort of trivial amount of missing as well.

731
01:26:50,990 --> 01:26:55,100
And then we ended up with 151 indices at the end.

732
01:26:55,610 --> 01:26:59,020
So in terms of what the outcome variable is, if you can picture it.

733
01:26:59,030 --> 01:27:05,870
So for each of the languages, we have text complexity indices for each of the items.

734
01:27:05,870 --> 01:27:08,300
So we have it in English, French and Spanish.

735
01:27:08,600 --> 01:27:14,810
Now we just did this in a kind of brute force, slightly dumb way, where we just literally got all the text for a question,

736
01:27:15,290 --> 01:27:21,409
bundled it together and ran the analysis on it, which again is a point I'll come back to at the end.

737
01:27:21,410 --> 01:27:29,420
But it's a starting point. We have to start somewhere. And so then the outcome variable then becomes the difference between the complexity index,

738
01:27:29,420 --> 01:27:34,220
for English versus French, and the difference between the complexity index for English versus Spanish.

739
01:27:34,450 --> 01:27:37,430
Okay. Because we have a DIF estimate for both of those pairings.

740
01:27:37,640 --> 01:27:41,990
English being the reference group in Spanish and French being the two focal groups.

741
01:27:43,390 --> 01:27:47,950
So this is just to show you that across the different categories there was still

742
01:27:48,700 --> 01:27:54,310
the reference in all the various indices represented the various categories quite well.

743
01:27:56,740 --> 01:28:03,700
So as I said before, a machine learning model is designed to optimise optimal sets of features to predict an outcome variable.

744
01:28:04,300 --> 01:28:09,190
How do you do that? How do you build the model? It's based on evaluation of different criteria.

745
01:28:09,250 --> 01:28:14,740
So basically, whether this lower prediction error rather or greater explanatory power.

746
01:28:15,220 --> 01:28:18,580
And in particular, I'd say what differentiates, say,

747
01:28:18,580 --> 01:28:25,899
machine learning or data science from just normal applications of statistics is that there's a lot of emphasis placed

748
01:28:25,900 --> 01:28:32,920
on cross-validation to avoid the overfitting of the data and to enhance the generalisability of the findings.

749
01:28:33,160 --> 01:28:39,430
And so if anyone ever tries to tell you that machine learning is something different to statistics, tell them they're lying,

750
01:28:39,430 --> 01:28:45,700
but just say, you know, there's a nuance in that, that there's a much stronger emphasis on this topic of cross-validation.

751
01:28:46,690 --> 01:28:51,009
So how do we cross validate things? Well, first and foremost, we have a training and test sample.

752
01:28:51,010 --> 01:28:56,710
So we're going to build a model that makes a certain set of predictions that's going to be based on the training set of data.

753
01:28:56,830 --> 01:29:00,460
Then we get the final model and then we apply it to a test set of data.

754
01:29:00,550 --> 01:29:07,720
The model has never seen before in order to get a unbiased view of whether the model performs well or not.

755
01:29:08,080 --> 01:29:11,980
And when we're training the data itself, we do something very similar,

756
01:29:12,070 --> 01:29:16,630
although a little bit more convoluted by way of what's referred to as K-fold cross-validation.

757
01:29:17,950 --> 01:29:25,450
Dividing the training data up into, in this case, five folds each time you train it on four of the folds test it on the face.

758
01:29:25,450 --> 01:29:30,160
Right? You iterate that around and then you get the final averages that gives you the model performance.

759
01:29:32,160 --> 01:29:40,239
So. We’re using those 150 indices or the different indices, the five categorical test characteristics.

760
01:29:40,240 --> 01:29:45,790
And we were looking at 999 standardised DIF estimates, 495 across the two languages.

761
01:29:46,840 --> 01:29:49,950
What we were using, we used both random forests regression.

762
01:29:49,960 --> 01:29:57,930
So this was for when DIF was a continuous outcome and also ordinal random forest regression for DIF as an audit category outcome.

763
01:29:57,940 --> 01:30:01,060
Remember what I said the small, medium, large.

764
01:30:01,870 --> 01:30:09,090
Now I don't think I'm going to have time to teach you what random forest regression is, but it is actually a pretty cool idea, to be honest.

765
01:30:09,100 --> 01:30:12,190
It's referred to as an ensemble learning approach.

766
01:30:12,190 --> 01:30:14,530
And the reason why is that what we have here,

767
01:30:14,530 --> 01:30:22,900
we have different decision trees and these are set up such that they are necessarily going to be different solutions to the problem,

768
01:30:23,140 --> 01:30:26,620
which then essentially get averaged at the end to give you a prediction.

769
01:30:27,010 --> 01:30:33,160
The great thing about it, it's nonparametric, it's non-linear, it's able to model very complex interactions.

770
01:30:33,400 --> 01:30:39,430
It can easily deal with both categorical and continuous predictors and it's robust against multicollinearity.

771
01:30:40,090 --> 01:30:43,180
The sorts of problems we routinely deal with in our world,

772
01:30:43,180 --> 01:30:49,230
especially when you've got 151 different text indices that are all only very subtly different from one another,

773
01:30:49,690 --> 01:30:54,130
and it's commonly used in machine learning circles, and it's somewhat interpretable.

774
01:30:57,300 --> 01:31:01,770
So I was going to teach you what a decision tree was, but let’s see if this works.

775
01:31:02,310 --> 01:31:08,490
So essentially what you're doing is you’re recursively partitioning the data in terms of the predictor

776
01:31:08,490 --> 01:31:15,300
variables so that you can purify the prediction in terms of the dependent variable as quickly as possible.

777
01:31:15,780 --> 01:31:21,460
So in this case, you've got two predictive variables, X and Y right? at the top there.

778
01:31:21,480 --> 01:31:24,800
It's basically looking for the partition.

779
01:31:24,810 --> 01:31:30,180
It can introduce where at the next node, the groups are as purified as possible.

780
01:31:30,180 --> 01:31:36,600
So let's say we want to predict whether it's a cat or a dog based on two predictor variables you've got.

781
01:31:36,600 --> 01:31:40,559
Does it have pointy years or it's white and all?

782
01:31:40,560 --> 01:31:46,469
Basically the algorithm will cycle through different levels of these predictive variables to essentially

783
01:31:46,470 --> 01:31:53,700
give you the initial split that will purify your beginning set of cats and dogs into something that's purer.

784
01:31:54,000 --> 01:31:57,930
So this one has more cats and this one has more dogs and so on and so forth.

785
01:31:58,230 --> 01:32:05,670
And it stops doing that when it gets to a point where any further bifurcations don't add any information to the process.

786
01:32:07,690 --> 01:32:12,580
And so why Random Forest? Well, how many trees are you going to have?

787
01:32:12,610 --> 01:32:18,640
So we actually just fixed it to have a thousand trees. What it does that is that it creates new samples.

788
01:32:18,970 --> 01:32:24,610
So it takes the original sample and it creates what's referred to as a bootstrapped sample.

789
01:32:24,790 --> 01:32:28,300
So it randomly samples from the original sample with replacement.

790
01:32:28,540 --> 01:32:36,699
So the  rose from the original sample might appear more than once in the bootstrap sample, and it will do that a thousand times with a thousand trees.

791
01:32:36,700 --> 01:32:44,980
So therefore every tree is unique. And then the second thing it does is it has a different number of predicted variables for each tree.

792
01:32:45,430 --> 01:32:47,200
And so how do you determine those numbers?

793
01:32:47,200 --> 01:32:53,530
Well, basically, again, you just brute force it and you say try every value and tell me which one works the best.

794
01:32:54,010 --> 01:33:00,750
So that's what we did. The results? As I said, they're not particularly compelling, but they're also not nothing.

795
01:33:00,760 --> 01:33:06,850
And to be honest, at the beginning of this research project, we were a little bit worried that this would just show up, nothing.

796
01:33:08,050 --> 01:33:15,880
And so you see here in the training set with the K-fold cross-validation, the R-squared ends up being about 0.12.

797
01:33:16,540 --> 01:33:22,730
And interestingly, when we then apply that to the test set, it's more like 0.20.

798
01:33:22,750 --> 01:33:30,790
So normally what you're looking for between training and test is that that number for test, isn’t too much lower than for the training

799
01:33:30,790 --> 01:33:36,880
Something strange is going on here where the test performance is actually much better than the training performance.

800
01:33:37,240 --> 01:33:46,030
So somewhere between 12 and 20% of that variance in that continuous DIF outcome variable is being predicted by this model.

801
01:33:47,660 --> 01:33:51,530
I don't, again, don't want you to read all of this other than to pay attention to the fact that

802
01:33:51,530 --> 01:33:55,429
remember the sort of basic characteristics of the items that were in the model.

803
01:33:55,430 --> 01:34:04,560
They don't appear here. All of these are NLP based predictors and then how does it determine whether something is important or not?

804
01:34:04,580 --> 01:34:12,530
Essentially, it pulls it out of all the models and says, how much of the explained variance do we lose based on pulling out that one predictor?

805
01:34:14,530 --> 01:34:18,249
In terms of predicting the ordinal outcome.

806
01:34:18,250 --> 01:34:26,980
Again, not great, but not nothing. So we again have this improved performance of the model in the test data as opposed to the training data.

807
01:34:27,460 --> 01:34:33,640
Somewhere between 34 and 40% of the language DIF effect size categories were predicted.

808
01:34:33,640 --> 01:34:40,390
But remember, you would predict a successful prediction just based on chance alone, which is where the Kappa comes in.

809
01:34:40,780 --> 01:34:47,020
And by any sort of conventional standards. That's not a very good Kappa, but it's not nothing either.

810
01:34:49,820 --> 01:34:54,980
Again. These are all NLP variables that are doing the heavy lifting in the modelling.

811
01:34:56,340 --> 01:35:02,999
So the models perform somewhat poorly, but they still predicted something, especially for the continuous DIF outcome.

812
01:35:03,000 --> 01:35:12,360
And I want you to keep in mind that when we got the expert reviewers right based on the cApStAn model, you call it as of translation,

813
01:35:12,360 --> 01:35:20,400
and we created a survey out of that and got them to compare the items, Ithink on six different criteria and predicted exactly zilch.

814
01:35:20,690 --> 01:35:25,020
Okay. So relative to that, I think this is doing okay.

815
01:35:25,740 --> 01:35:26,550
And also,

816
01:35:26,970 --> 01:35:35,370
you can see here that the predictors that were doing the heavy lifting do come from the automated natural language processing of the text.

817
01:35:37,200 --> 01:35:42,780
So the future for this work. I personally think these findings show promise; you might disagree.

818
01:35:44,130 --> 01:35:50,020
What do I think we need to do to improve it? I think we need purer and larger estimates of DIF by language.

819
01:35:50,040 --> 01:35:56,069
What do I mean by purer? I think we need datasets which basically remember I said we tried to decontaminate

820
01:35:56,070 --> 01:35:59,550
the DIF estimates by way of having other covariates in the DIF model.

821
01:35:59,820 --> 01:36:06,930
Well, we could only do so much with that. So I think that we need to get other datasets which enable us to do that even more so.

822
01:36:06,940 --> 01:36:11,759
So for example, with international large scale assessments, PISA, PIRLS,within a country,

823
01:36:11,760 --> 01:36:16,049
they'll have people doing the assessment in multiple languages.

824
01:36:16,050 --> 01:36:18,840
And it just enables you, once you build up a dataset like that,

825
01:36:18,840 --> 01:36:24,870
to partial out some of that variance attributable to other factors that contaminates the language DIF, so to speak.

826
01:36:25,560 --> 01:36:30,710
Also, what's this ever really going to work that well for science while science, you know,

827
01:36:31,380 --> 01:36:36,390
particularly when we're talking about physics at what we’d call a Year 12 level in Australia,

828
01:36:36,630 --> 01:36:44,760
there's a lot of mathematics and formulae in there. The same goes for chemistry and biology is a little bit more language heavy but still.

829
01:36:45,330 --> 01:36:51,330
So probably this wasn't the best starting place to apply this something like a reading assessment, we probably worked better.

830
01:36:51,870 --> 01:37:00,060
And also coming back to this idea of us just sort of jumbling all the text together for each item and running at the NLP analysis on that.

831
01:37:00,360 --> 01:37:07,530
It's not a very intelligent approach. We need a more nuanced approach to evaluating text complexity, complexity by item type.

832
01:37:07,890 --> 01:37:14,400
When you think about it, a multiple-choice item is a very, very complex linguistic entity.

833
01:37:14,970 --> 01:37:24,180
Okay? And just the sheer amount of text and its qualities in it only go so much of the way to explaining the complexity of the item,

834
01:37:24,390 --> 01:37:26,730
and same goes for constructive response items.

835
01:37:28,050 --> 01:37:34,650
And as a final point, I come back to the sort of general premise of the whole research and whether it's valid or not.

836
01:37:34,650 --> 01:37:42,180
And I'm sure that there’s some people in the room have ideas on this. So is automatic cross-lingual text complexity assessment even possible?

837
01:37:42,810 --> 01:37:48,090
So we need to build models of text complexity on common corpora as READERBENCH exist right now.

838
01:37:48,270 --> 01:37:53,850
Each of the languages is basing itself on a different corpus, which I don't think is very good.

839
01:37:55,050 --> 01:38:01,090
We need to build models of text complexity, using new language neutral tools for tagging passing so that, you know,

840
01:38:01,140 --> 01:38:07,590
as Art alluded to, the world of NLP, in the research world of NLP like every week is making advances.

841
01:38:07,600 --> 01:38:12,330
So, keeping keeping up to date on what's happening there,

842
01:38:12,690 --> 01:38:18,890
validating the comparability of the complexity categories and these need to be psychometrically evaluated.

843
01:38:18,900 --> 01:38:23,850
So, for me when I saw this that it screams out to me this is a factor analysis problem.

844
01:38:24,360 --> 01:38:32,519
The idea that you have multiple indicators of these categories that are sort of latent latent categories underlying what the indices are about.

845
01:38:32,520 --> 01:38:37,590
And once you approach it from that way, it also immediately leads to questions of, well,

846
01:38:37,740 --> 01:38:42,870
if you get a certain factor analysis solution for one language, do you get it across other languages?

847
01:38:42,870 --> 01:38:49,950
And you can actually start answering the question of whether these indices do work in an invariant way across languages.

848
01:38:50,820 --> 01:38:55,590
I'd say generally it's more plausible within major language families, families than across.

849
01:38:55,590 --> 01:39:04,110
So, you know, Indo European languages within might work, okay, but not necessarily when you bring in other major language family.

850
01:39:04,440 --> 01:39:09,000
And generally it just remains a very interesting empirical question to us.