Fine-Tuning Assessments for AI Integration
Leveraging AI use and preventing AI misuse: a best-of-both-worlds approach
Summary
There is a trade-off in assessment between leveraging AI use to support learning and mitigating the risks of AI misuse to bypass learning
As such, assessments can be AI-Blind, AI-Proof, AI-Poor… but also AI-Rich, when they manage to combine the best of both worlds
An AI Scorecard is proposed to measure AI-Richness
An AI Dashboard approach is proposed to guide the design of AI-Rich assessments by fine-tuning AI use for specific steps and tasks in the learning process
AI-Blind, AI-Proof, AI-Poor, and AI-Rich Assessments
AI Assessment Scales, such as the one analyzed in the previous article in this series, are student-facing tools. A mix between an AI policy and an AI literacy document, they take a given (often traditional) assessment and clarify for students the nature and extent of appropriate AI use in this context.
What we need, however, is a teacher-facing tool helping educators re-design assessment for the AI Age.
In regard to AI-integration, assessments can be categorized as:
AI-Blind Assessments: Traditional and “unsecured” (Liu and Bridgeman) forms of assessments that are well-adapted to leverage AI’s potential, but ill-adapted to avoid its pitfalls. This includes long-term research papers, where AI can serve scaffolding and engagement purposes, but where it is also hard to control and detect. More generally, this includes the type of assessments that AI Scales target as they try to retrofit AI use into standard educational procedures.
AI-Proof Assessments: Forms of assessment where AI use is easy to forbid, either because of their “secured” context (e.g, in-class tests) or because of the nature of the submission (e.g., non-digital Art). These “Lane 1” assessments (Liu and Bridgeman) can help avoid the pitfalls of AI use, but also make it harder to leverage the potential of this technology for learning. While some might be traditional, such as closed-book exams, this category also encompasses many “alternative assessments” that used to be formative favorites, such as Socratic seminars. Importantly, it should be noted that “AI Proof-ness” is a moving line. As technology advances, some assessments that are currently AI-proof might not be anymore, for instance as real-time wearable AI use becomes undetectable.
AI-Poor Assessments: Clearly, leveraging AI potential for learning and preventing AI misuse present a trade-off (see diagram below). In this regard, AI-Poor assessments are a worst-of-both-worlds scenario, where AI is hard to use to enhance learning, but easy to use to bypass it.
AI-Rich Assessments: Types of assessments that combine high levels of AI-integration to support and enhance learning acquisition and demonstration, and high levels of confidence in the measurement of student competencies. Leveraging AI’s potential all while mitigating its pitfalls, AI-rich assessments adopt a best-of-both-worlds approach. This might sound like a lofty dream, but it is actually the aim of AI Assessment Design.
AI-Rich Assessments, a Best-of-Both-Worlds Approach
A “best of both worlds” approach is possible because “AI-Blind” and “AI-Proof” assessments are concepts, not objects. They are abstract, general definitions, and particular assessments can borrow from both at the same time. Notably, AI-Proof assessments do not necessarily forbid AI use: they make it easy to control AI use, meaning that they can allow it or not for specific purposes.
Notice how this differs from an AI Scale approach. What we are aiming for, here, is not a scale with a few fixed “levels” of AI use (I have explained the limitations of this construct in a previous article), but rather a dashboard with fine controls over AI use for specific aspects of an assessment. With this approach, it is possible to design assessments so that they optimize AI-integration, leveraging its potential and mitigating its pitfalls.
What are the design features of an optimal assessment in the new AI age?
It should prevent AI misuse. AI pseudo “detectors” are unreliable (and inequitable) because they attempt to catch AI misuse after the fact. Using them, we actually model for students the very poor practices that we are trying to prevent:
- We use AI as a shortcut crowding out good teaching and learning practices
- We use a tool that is biased and potentially harmful
- We display overreliance on an unreliable technology
- We relinquish responsibility when we should always keep humans in the loop
- We prevent the development of AI literacy and competencies
It should propose answers to the question: “Why might students misuse AI?” Reasons include:
- Students do not understand what AI uses are appropriate/inappropriate
- Students do not have the skills needed to use AI appropriately
- Students misuse AI in an attempt to meet their learning needs
- Students face a tempting risk-benefit trade-off
- Students look for shortcuts for lack of interest in the assessment
It shoud define clear learning objectives and describe them in such a way that they are operationalized in terms of specific cognitive operations.
It should leverage AI to enhance students’ motivation to achieve these objectives
It should enable students to leverage AI to achieve (or surpass) these learning objectives
It should allow educators to determine with high levels of reliability whether students are able to achieve these objectives, i.e., to perform specific cognitive operations.
It should be an opportunity for students to develop AI literacy and competencies.
These design features are captured in the Assessment AI Scorecard below.
AI Assessment Fine-Tuning Dashboard
The AI Assessment Scorecard enables educators to measure the appropriateness of existing assessments to the new AI context, and more importantly to make the necessary adaptations, especially as they design new assessments.
An AI Dashboard is a tool helping teachers create assignments that score high on the Scorecard by deriving from the learning objectives and their operationalization:
How AI can be leveraged to support the acquisition and demonstration of learning, including motivation and scaffolding.
How AI use can be controlled to ensure a reliable assessment of learning
To do so, the Dashboard clarifies:
The specific competencies assessed (e.g., selecting relevant primary sources)
The multiple, discrete steps and tasks involved in the development and demonstration of this competency (e.g., finding a source, recalling and applying a critical framework, expressing ideas…) — and potentially others involved in other aspects of the assessment
The appropriate degree of AI use for each step in this learning process
Rather than define an overall appropriate “level” of AI use, the teacher is then able to fine-tune AI integration with dimmer switches for each relevant step and task, and thus to leverage its potential, all while mitigating its pitfalls.
To see what this might look like, let’s visualize how such a Dashboard differs from an AI Scale.
(1) and (2) are examples of AI Scales. In the former, students are only allowed to use AI to conduct research, while in the second they can also use it to assist their analysis. Turning these ideas into a draft and a final submission must be their own work, however.
Clearly, these scales try to regulate how much of the final product is AI-generated. For reasons explained in the previous article in this series, this is not the right approach.
Looking at (3), it becomes obvious that there is no reason for any particular sequence of steps in an AI Scale to be cumulative, or even progressive. As a teacher, I could very well want to let students leverage AI use for a particular step, but not for the preceding or following ones.
“Process” is the key word, here: AI Scales are limited because they focus on the product and how much AI content is in there, while the right pedagogical approach is to focus on the learning process, and thus on when and how AI is used to develop and demonstrate understandings and skills.
Depending on the learning objectives, I might want to design and fine-tune this learning process in any number of ways that cannot be captured by a scale, which can only be one-dimensional by definition.
In this visualization of an AI Dashboard:
Appropriate AI integration is defined for each step in the process (based on a 4-point scale I created last year from a friendly critique of Leon Furze’s original AIAS)
Descriptors operationalize the specific competencies (including AI competencies) students are expected to demonstrate
The responsibility is placed on the students to provide evidence of these competencies (rather than requiring that they disprove potential AI misuse)
AI tools and uses are suggested both to support and enhance learning
Interestingly (although not surprisingly, given the previous article in this series), rotating this Dashboard 90º gives a familiar figure: a learning scale, or standards-based rubric, where students develop increasingly complex understandings and skills as they leverage and outgrow AI assistance for specific purposes.
In the partially completed example below, the teacher selected the expected level of performance — and thus also of AI assistance — at a point in time, i.e., for this particular assessment.
Conclusion
A fine-tuning approach to AI integration in assessments opens up the possibility of combining the “best of both worlds”: leveraging AI use to support learning and preventing AI misuse to bypass learning.
Even better, the necessary AI Assessment Dashboard:
Goes beyond AI Scales by building on those standards-based rubrics aligned with best practices known as “learning scales”
Integrates standards as steps in the learning and generative process, inviting unit-long authentic assessments, such as the ones I proposed a year ago
Embeds AI competencies into the performance indicators, paving the way for the integration of AI literacy into the curriculum.
This is very cool!!!