Introduction
The purpose of this article is to present our approach to the problem of
automatically grading short answers by students given a handful of examples of
graded answers for a particular question. The additional particular requirements
for our problem setting are
- the model must be very computationally efficient
- the number of supported students can be high and the funding per student very low
- the model should be able to perform live regrades in a matter of
milliseconds for each student
- the model must be resistant to adversarial attacks
- inputting gibberish or a completely irrelevant answer should never result
in a grade other than 0
- the model must be able to support questions from various branches of science
and liberal arts and in multiple languages
- it should work for any subject taught in school
- it should support at least the major world languages
We propose to solve the problem by using existing multi-lingual language
embedding models, which convert questions and answers into a fixed-length
real-numbered context vector, to which we apply online regression methods.
In order to make the model additionally resistant to adversarial inputs, we
build an additional relevancy model whose purpose is to determine if the
answer is relevant to the question (even if it is incomplete or wrong, does it
at least pertain to the subject).
In this work we propose to use an existing pre-trained, transformer-based,
multilingual model to generate embedding vectors -- real-valued, fixed-length
vectors -- representing every answer to train
- a grade classification model outputting the probability of a grade for each
numerical grade (from 0 to N)
- a relevancy classification model outputting probability that the answer is
relevant (on topic, even if wrong or incomplete) or irrelevant (if in a wrong
language or completely inapplicable)
Short Overview of Practical NLP
The field of natural language processing (NLP) has a long history and includes
many completely different approaches from explicit text structure discovery to
deep learning methods, most recently (around 2017) involving the transformer
model https://arxiv.org/abs/1706.03762.
We will focus on the recent deep-learning approaches and will use a
transformer-based embedding model. In this paradigm, the text to be embedded
(converted to a real-valued vector) needs to (1) first pass through a tokenizer,
getting converted to a list indices representing words or word fragments in a
set of words & word fragments known to the model, (2) then the token index list
is passed once forward through the transformer-based model and averaged into a
fixed length real-valued vector.
Thus, we have two components, (1) a tokenizer which converts text to integers
indices and (2) a model which converts integers into a hopefully meaningful
real-valued vector corresponding to the vector. Tokenizers and models are
usually bundled together since (i) a model trained on one tokenizer will only
work well with a tokenizer that is the same or differs by only a very small
number of words and (ii) that tokenizer was shown to successfully result in a
good model -- if the model is a pre-trained or verified to work well.
The Tokenizer
The purpose of a tokenizer is to convert a text into a set of numbers that a
language model can ingest. A tokenizer typically has a fixed dictionary where
each word (or a word fragment) corresponds to a unique integer index. The
tokenizer then converts each word in the text into the corresponding integer
index. To avoid an excessively large vocabulary, tokenizers sometimes split
longer words into smaller segments that are shared between words. These short
words and common fragments are collectively known as tokens. For English, a
text of length N will usually be converted into 1.33 N tokens.
Apart from words, the vocabulary of a tokenizer can also contain special tokens
like end of text indicators, end of sentence indicators, unknown word
indicators, etc. An unknown word is either split into existing tokens (if it can
be decomposed as such) or replaced with the unknown word token.
A tokenizer converts words into numbers, but sometimes splits longer (or
unknown) words into shorter segments to both avoid an exploding size of the
vocabulary and to allow the model to generalize to new words.
A trained language model will typically have a specific tokenizer associated
with it as changing which integer corresponds to which token will result in
model receiving unrelated inputs.
Fig: A tokenizer converts words into numbers, but sometimes splits longer (or
unknown) words into shorter segments to both avoid an exploding size of the
vocabulary and to allow the model to generalize to new words.
The Embedding Model
An embedding model is simply a language model that converts the tokenized text
into a fixed length real-valued vector for further downstream tasks. Many
language applications start with a pre-trained embedding model and append
another, task-specific model to work using the embeddings produced by the
embedding model. Typically, embedding models are trained in such a way that
their embedding output produces good results when used with a variety of
downstream models. Some applications fine-tune the embedding model, not only
train the downstream model to achieve better results, but that sometimes risks
reducing the generalization properties of the embedding model (since the
embedding model is often trained on a very diverse dataset for the express
purpose of achieving good generalization properties).
Fig: A pre-trained embedding model followed by a downstream model is a common
approach to solving natural language processing tasks.
Modern embedding models often use the transformer architecture. Because most
transformers are designed to work on real-valued vector representation of words,
the integer output of the tokenizer is often first passed through an embedding
layer which simply projects an integer index into a random (but potentially
learnable) real-valued vector.
Training and inference is often more efficiently done in batches, but most
transformer architecture require that all inputs in a batch have the same
length. Most tokenizers are designed to pad all, but the shortest input in a
batch with special pad tokens which are supposed to not affect that batch entry
fragment embedding value. A text fragment (a batch entry) with and without pad
tokens should produce the same embedding value (which is usually the case;
except when the embedding model is using very low precision arithmetic, like
int8 , as some errors tend to accumulate). Additionally, most embedding models
are designed with a max input text length and the associated tokenizer will often
truncate longer texts by simply cutting off the end of the text after the
maximum numbered token.
The embedding model is simply a fixed-length map from a variable length text to
a fixed-length real-valued vector. For transformer models, the embedding is
often simply a summation of the final embedding of each token in the text.
where 768 is an example number dependent on the model architecture.
The Problem: Automatically Grading 100s of Assignments
The problem described in this article is to develop a system that can
automatically grade short answer problems -- problems for which the student is
required to write an answer in the form of a short text.
The system is designed to generalize from a small handful of teacher-graded
answers (circa 20-30) to automatically grade remaining future answers, as well
as grading student answers in real-time (below 1 second).
Estimating Relevance
Because the model is available to the students in real-time, we want to avoid
the possibility that the student will insert irrelevant text into the answer box
until the answer gets a high grade. Some examples could include:
- an empty answer
- a copy of the question
- an expertly written answer from the same domain, but not relevant to the question
For that reason we attempt to estimate relevance as the probability that an
answer is relevant to the question. Importantly:
- an incorrect answer on topic is still relevant
- an incomplete answer is still relevant
The Relevance Model
The relevance model is a simple downstream binary classifier model built on top
of the embedding model. The model is trained to predict whether an answer is
relevant to the question or not. The architecture is a simple multi-layered
perceptron (MLP) trained using gradient descent with validation set
hyperparameter tuning (batch size, learning rate, dropout, optimizer) using
ray.tune.
The dataset is a combination of several instruction datasets (e.g.,
RedPajama)
(popularized by large language model research). Negative examples are produced by
randomly assigning questions (instructions) to responses.
Using a Pre-trained Model
We use the multilingual BERT model for its excellent ratio of generalization
capability and computational efficiency. The model weights (expressed in
float32 arithmetic) are only around 600 MB.
Meta-learning the Model
The meta-learning problem is to improve the result of an adaptation process
where is the model target and is the data.
In other words, it's a standard optimization problem where the model
parameters are obtained from some adaptation process. The standard examples are:
- hyperparameter optimization - are hyperparameters, and the process is the full training of the ML model
- meta-learning classification - are initial model weights and the process is a few gradient descent steps
- optimal controller tuning - is the cost function of a controller (e.g.,
cost matrices in the Linear Quadratic Regulator (LQR) problem), the process is the solution to the LQR problem (the Riccati
equation solution)
- rational intent inference - is the set of weights a rational agents
assigns to rewards in the world and the process is the result
of rational planning
- online learning - parametrizes e.g., how aggressively we should learn from
examples observed online and the process of adapting our model
online from new observations
Our case: online learning of assignment grading
In our case, we use meta-learning for online learning. The problem is learning a
grading model from graded assignment examples: what type of grade should a
particular student response receive.
We attempt to use a fully data-driven approach. We do not perform any analysis
of the text, we learn a numerical model that maps from the text of the question
and the student response to their grade. For each question, we use a handful of
graded responses (e.g., 20-30) and a semantic language embedding model to
convert the question and responses to fixed-length floating point vector. We
then fit a numerical classification model from the embedding vectors to the
grade.
Usual meta-learning methods
As a quick detour, let's consider two simplest ways to implement this online
meta-learning adaptation problem:
- MAML-style meta-learning
- adapt your model from a base model by taking a few steps of gradient
descent, then differentiate through this adaptation process using automatic
differentiation software
- MAML is the simplest form of gradient-based meta-learning
least-squares adaptation
- adapt a linear model of the form where is a
vector and is a scalar
- in most cases, is not the data directly (like the text of the student
response), but an embedding by a deep learning model
the adaptation process is minimizing the squared difference between
observation and the model prediction
which has a closed-form solution
because the last operation is a simple linear algebra operation (a matrix
solve or matrix inverse operation), the adaptation process can be
differentiated through using automatic differentiation software
Problems with the usual meta-learning methods
There are however, several problems with these two approaches:
- MAML tends to require large networks to work efficiently, this leads to high
latency and high energy cost of running the model
- MAML tends to be fairly data inefficient in training (several steps of
gradient descent are, to begin with, a very small adaptation); this means long
and expensive training
- least-squares adaptation works very well for linear regression models, where
the task is predicting a scalar, but is of limited value for classification;
classification both (a) tends to work better for grade prediction and
(b) gives a richer signal, i.e. being able to indicate the answer is either
completely wrong or completely right, giving high probability to a score of 0
and a score of 10, instead of predicting 5
Our approach: online logistic regression
For these reasons, a much better approach is to use a linear model for logistic
classification. The linear model maps (potentially embedded) data to a
vector of the same length as there are grades (e.g., 11 classes if the grade
varies between 0 and 10 inclusively) and the objective to minimize is the
logistic loss.
where
and
This is a very common loss function for classification problems referred to as
cross-entropy or multiclass logistic loss. acts as an indicator function,
selecting only the softmax output ( ) of the correct class. However, since
softmax is normalized over all dimensions of the problem output, the model is
encouraged to both increase the probability ( ) of the correct class and
decrease the probability of other classes .
Solution to the online logistic regression problem
This loss function is in fact a convex function implying it has a single global
minimum. However, there is not closed-form solution, the optimization is solved
iteratively using the Newton's method.
For optimality, we need the gradient of the loss function with respect to the
model parameters to be zero.
And we can iteratively apply the Newton's method to find that zero-gradient point:
Because the loss function is convex, this iterative process is guaranteed to
converge in a few steps.
Differentiating through the solution for meta-learning
The easiest way of differentiating through an iterative optimization process
(like the Newton's method in our case) is to simply keep track of the
computation graph performed and exploit automatic differentiation software to
obtain gradients through the adaptation process.
However, for logistic regression, it is often far easier to compute the gradient
and Hessian terms in the Newton's method iteration using automatic
differentiation itself, which implies many levels of differentiation when
attempting to then differentiate through the adaptation process itself. This is
more of a practical problem, since doing so certainly results in large memory usage
(having to store the entire optimization history) and can be computationally slow
(depending on the level of computation graph optimization provided by the
automatic differentiation software used). Instead, we propose to use the
implicit function theorem, which lets us differentiate through this adaptation
process with ease.
The gradients we need to compute
In order to minimize the meta-learning objective:
We need to be able to compute the gradient
however, the eval loss really depends on which itself depends on , so we really have
The gradient can be obtained easily with any
automatic differentiation software. The gradient can be
computed using the implicit function theorem.
The implicit function theorem technique
The implicit function theorem gives a way of obtaining derivatives with respect
to a variable defined as an implicit function:
Here, is defined as satisfying the implicit function . The value
of is not given explicitly. The implicit function theorem states that if is further parameterized by a parameter , we can obtain derivatives of with respect to .
Applying the implicit function theorem to the meta-learning problem
Returning to our notation, we know that , the parameters adapted in the
Newton's method are optimal, so they must satisfy the function
defines an implicit function for . We can then apply the implicit
function theorem directly.
This has two main advantages:
- memory-usage: it does not rely on the Newton's method optimization history
- which yields memory-usage improvements
- efficient computation: it only requires 2 levels of differentiation which
is much easier for most automatic differentiation software to compute
efficiently
Adapting the Relevance
Separately, we can also adapt our relevance model. The model is non-linear, so
we cannot directly apply the meta-learning technique from the previous paragraph.
We instead adapt the relevance model online, for each question separately, using
gradient descent while observing the evaluation loss. We choose the number of
online gradient descent steps that leads to the lowest evaluation loss (we split
the small online ground-truth data into a training and an evaluation set
randomly).
This is a MAML-like procedure of applying gradient descent online. However, we
do not specifically train the relevance model to be good at this adaption. This
is not strictly meta-learning since we do not learn the relevance model to be
good at online adaptation, but the fact that we dynamically choose the number of
adaptation steps to minimize an evaluation loss improves the results over static
adaptation.
Alternative Approaches
One obvious alternative approach is using a large language model (LLM),
specifically the instruction interface (a chat mode, where the model responds
to instruction) to grade answers using graded examples. The benefit of this
approach is that it is much more flexible: grading can be based on example
grades or a grading rubric directly, but the downside is the high computational
resources required to run the model.
The computational cost in the large language models scales with the context size
linearly or super-linearly. For our problem setting, where we wish to provide
the model with example graded answers, including each additional example answer
increases the computational cost and places a soft limit on the maximum number
of examples considered. In contrast, our approach uses a much smaller embedding
model, where the embedding cost is strictly linearly dependent on the number of
embedded answers and both the online adaptation and classification costs are
virtually negligible in comparison to the embedding cost.
Data Generation
Data for training both the relevance model and the grading model can be obtained
from real responses, since labeling is relatively inexpensive:
- grades are assigned to questions in the existing, manual grading scheme
- relevance scores are binary and they are often easy to assign
However, training both models could benefit from large amounts of synthetic
data. Recent advances in large language models yield instruction-agents, models
which can generate natural sounding text in a variety of languages.
We can easily use a model like this (in this case ChatGPT-3.5 from OpenAI) to
generate examples of relevant and irrelevant synthetic responses for training.
Consider the example:
Instruction:
I want you to generate examples of relevant and irrelevant answers to the
question: How many cases are there for finding roots of a quadratic equation? I
want you to generate 50 relevant and 50 irrelevant answers. The relevant answers
should include both correct, incorrect, and also incomplete answers. The
irrelevant answers should look like they are designed to fool and automatic
grading system for that question. The irrelevant answers should be completely
off-topic and some of them should be nonsensical and/or in a different language.
Make both categories of answers include long and short answers.
ChatGPT:
Relevant Answers:
There are generally three cases for finding roots of a quadratic equation.
Two cases are considered when finding roots of a quadratic equation.
The number of cases for finding roots of a quadratic equation depends on the discriminant.
The cases for finding roots of a quadratic equation can vary depending on the coefficients.
One case involves finding two distinct real roots for a quadratic equation.
Incomplete answer: The cases for finding roots of a quadratic equation are determined by the...
Incorrect answer: Quadratic equations always have four cases for finding roots.
There can be two imaginary roots when solving a quadratic equation.
An incomplete answer: The cases for finding roots of a quadratic equation depend on the number...
One case involves finding one real root for a quadratic equation.
Long answer: The number of cases for finding roots of a quadratic equation can be determined by analyzing the discriminant, which is the expression b^2 - 4ac. If the discriminant is greater than zero, there are two distinct real roots. If the discriminant is equal to zero, there is one real root, which is repeated. And if the discriminant is less than zero, there are two complex roots, which are conjugate pairs of the form (a + bi) and (a - bi), where a and b are real numbers and i is the imaginary unit.
Incomplete answer: Quadratic equations have multiple cases for finding roots, such as...
Incorrect answer: The number of cases for finding roots of a quadratic equation is always two.
There are multiple cases for finding roots of a quadratic equation, including complex solutions.
Long answer: When solving a quadratic equation, there are three possible cases: having two distinct real roots, having one real root (with multiplicity), or having two complex roots.
One case involves finding no real roots for a quadratic equation.
Incorrect answer: There is only one case for finding roots of a quadratic equation.
The number of cases for finding roots of a quadratic equation depends on the coefficients and the degree of the equation.
Incomplete answer: The cases for finding roots of a quadratic equation can be classified into...
One case involves finding two equal real roots for a quadratic equation.
Long answer: The number of cases for finding roots of a quadratic equation can vary from zero to two, depending on the values of the coefficients a, b, and c in the equation.
Incorrect answer: Quadratic equations have different cases for finding roots based on their leading coefficient.
There are cases where a quadratic equation may have infinite solutions.
Incomplete answer: The cases for finding roots of a quadratic equation can be determined by analyzing the quadratic formula.
Two cases are considered when finding roots of a quadratic equation: real and complex.
Long answer: The number of cases for finding roots of a quadratic equation is determined by the nature of the discriminant. If the discriminant is positive, there are two distinct real roots. If it is zero, there is one real root. And if it is negative, there are two complex roots.
There are various scenarios for finding roots of a quadratic equation, such as...
Incomplete answer: The number of cases for finding roots of a quadratic equation depends on the values of the coefficients and the discriminant, which is calculated using the formula b^2 - 4ac.
Incorrect answer: Quadratic equations can have an infinite number of cases for finding roots.
The number of cases for finding roots of a quadratic equation can be determined by examining the sign of the discriminant.
Long answer: The cases for finding roots of a quadratic equation can be categorized as follows: two real and distinct roots when the discriminant is positive, one real repeated root when the discriminant is zero, and two complex conjugate roots when the discriminant is negative.
Incorrect answer: The number of cases for finding roots of a quadratic equation is determined by the order of the equation.
There are specific criteria to determine the cases for finding roots of a quadratic equation.
Incomplete answer: The cases for finding roots of a quadratic equation depend on various factors, including...
The number of cases for finding roots of a quadratic equation can be identified by analyzing the coefficients and the discriminant.
Long answer: Depending on the values of the coefficients a, b, and c in a quadratic equation, there can be three cases: two distinct real roots, one repeated real root, or two complex roots.
Incorrect answer: Quadratic equations have a fixed number of cases for finding roots, regardless of their coefficients.
The cases for finding roots of a quadratic equation can be determined by examining the leading coefficient.
Incomplete answer: The number of cases for finding roots of a quadratic equation is determined by...
One case involves finding two imaginary roots for a quadratic equation.
Long answer: The number of cases for finding roots of a quadratic equation depends on the value of the discriminant. If the discriminant is positive, there are two distinct real roots. If it is zero, there is one real root. And if it is negative, there are two complex roots.
Incorrect answer: Quadratic equations can have a variable number of cases for finding roots, depending on the degree of the equation.
The cases for finding roots of a quadratic equation can be classified based on the values of the coefficients.
Incomplete answer: The number of cases for finding roots of a quadratic equation can be determined by examining the coefficients and the discriminant, which is the part of the quadratic formula under the square root symbol.
There are specific patterns to identify the cases for finding roots of a quadratic equation.
Long answer: Depending on the coefficients and the value of the discriminant, a quadratic equation can have either two distinct real roots, one repeated real root, or two complex roots.
Incorrect answer: The number of cases for finding roots of a quadratic equation is always the same, regardless of the equation's form.
There are multiple possibilities for the cases when finding roots of a quadratic equation.
Incomplete answer: The cases for finding roots of a quadratic equation can be determined by analyzing the equation's factors.
The number of cases for finding roots of a quadratic equation can be determined by the nature of the discriminant.
Irrelevant Answers:
The quadratic equation is a type of dance move that originated in the 1920s.
¡Hola! No sé qué es una ecuación cuadrática.
The number of cases for finding roots of a quadratic equation is 42, the meaning of life, the universe, and everything.
Quadratic equations are like a box of chocolates; you never know how many cases you'll get.
Quadratic equations have zero cases for finding roots because they are too shy to show themselves.
Quadratic equations have exactly five cases for finding roots, just like the fingers on your hand.
The number of cases for finding roots of a quadratic equation depends on the price of tea in China.
Quadratic equations have a secret code, and only those with the right decoder ring can determine the number of cases for finding roots.
Sorry, I can't answer that. I'm busy counting sheep.
Quadratic equations are aliens from another dimension, and their roots cannot be found in our universe.
The cases for finding roots of a quadratic equation are locked away in the lost city of Atlantis.
Quadratic equations have a unique case for finding roots that involves time travel.
The number of cases for finding roots of a quadratic equation depends on the alignment of the stars.
Quadratic equations have a secret society that decides the number of cases for finding roots.
The cases for finding roots of a quadratic equation are written in a language only mathematicians understand.
The number of cases for finding roots of a quadratic equation is classified information.
Quadratic equations have an unlimited number of cases for finding roots because they are magical.
The cases for finding roots of a quadratic equation are stored on a floppy disk.
Quadratic equations have a secret handshake that reveals the number of cases for finding roots.
The number of cases for finding roots of a quadratic equation depends on the color of your socks.
Quadratic equations have a hidden treasure map that leads to the number of cases for finding roots.
The cases for finding roots of a quadratic equation are like a crossword puzzle waiting to be solved.
Quadratic equations have a special case for finding roots that involves talking to plants.
The number of cases for finding roots of a quadratic equation depends on the length of your hair.
Quadratic equations have a secret language that only cats can understand.
The cases for finding roots of a quadratic equation are stored in a parallel universe.
Quadratic equations have a secret dance routine that reveals the number of cases for finding roots.
The number of cases for finding roots of a quadratic equation depends on the phase of the moon.
Quadratic equations have a secret recipe that determines the number of cases for finding roots.
The cases for finding roots of a quadratic equation are hidden in a riddle that only the Sphinx can solve.
Quadratic equations have a secret door that opens to reveal the number of cases for finding roots.
The number of cases for finding roots of a quadratic equation depends on the position of the planets.
Quadratic equations have a secret codebook that reveals the number of cases for finding roots.
The cases for finding roots of a quadratic equation are written on the wings of butterflies.
Quadratic equations have a secret handshake that only mathematicians know.
The number of cases for finding roots of a quadratic equation depends on the color of your eyes.
Quadratic equations have a secret language that only dolphins can understand.
The cases for finding roots of a quadratic equation are hidden in a hieroglyphic puzzle.
Quadratic equations have a secret map that leads to the number of cases for finding roots.
The number of cases for finding roots of a quadratic equation depends on the alignment of the planets.
Quadratic equations have a secret society that determines the number of cases for finding roots.
The cases for finding roots of a quadratic equation are written in a lost ancient script.
Quadratic equations have a secret formula that reveals the number of cases for finding roots.
The number of cases for finding roots of a quadratic equation depends on the phase of the sun.
Quadratic equations have a secret chant that reveals the number of cases for finding roots.
The cases for finding roots of a quadratic equation are hidden in a crossword puzzle.
Quadratic equations have a secret key that unlocks the number of cases for finding roots.
The number of cases for finding roots of a quadratic equation depends on the song playing on the radio.
Quadratic equations have a secret message that reveals the number of cases for finding roots.
The cases for finding roots of a quadratic equation are stored in a secret vault.
|