How we approach adaptivity in the practice modules – DeepSpectrum Lab

Why adaptivity matters here

Practice modules in social cognition, theory of mind, and emotion regulation work where the difficulty connects cleanly to the current ability of the child. Vygotsky described this with the zone of proximal development; the instructional-psychology tradition has discussed the same phenomenon since Bloom (1984) under the heading of mastery learning. Both lines converge on the same empirical observation: practice substantially below or above current ability produces either stagnation or withdrawal.

In our modules, adaptivity governs three dimensions: item selection (which exercise comes next), repetition scheduling (when an already mastered concept is reactivated), and complexity level (the depth at which a construct is operationalised). It does not govern the underlying content. The exercises are fixed and aligned with the constructs we describe in the article on the research foundations.

What adaptivity is not, in our design

Adaptivity is not a personality model of the child. We do not infer learning styles, personality dimensions, or stable attentional traits. The meta-analysis by Pashler and colleagues (2008) marked the learning-styles hypothesis as non-replicable; later reviews (Kirschner 2017) confirmed the picture. We treat stable trait attributions from short interaction sequences as methodologically untenable.

It is also not a recommendation engine in the sense familiar from consumer software. There is no latent score predicting which exercise the child will enjoy or complete. Engagement is not the optimisation target. Were it one, the modules would systematically drift toward situations where they no longer train a construct but only feed a confirmation loop.

The psychometric backbone: Rasch model as a starting point

The formal basis of difficulty control is the Rasch model, the one-parameter special case of item response theory. Every exercise carries a difficulty parameter β, every child a latent ability estimate θ. The probability of a correct solution is modelled by the logistic function. At θ = β, the probability of solving the item is 0.5.

P(X = 1 \mid \theta, \beta) = \dfrac{1}{1 + \exp(\beta - \theta)}

Rasch model (1)Probability of a correct response as a function of ability θ and item difficulty β.

Logistic response curve of the Rasch model. Solution probability is 0.5 at θ = β; we steer items toward a target hit rate P* ≈ 0.75.

Choosing the Rasch model over the two-parameter model (with an additional discrimination parameter α) or the three-parameter model (with a guessing parameter γ) is methodologically motivated. Rasch permits sufficiency-based estimation, separability of person and item parameters (specific objectivity, Rasch 1960), and stable estimates with small samples. That is exactly what we need: small pilot groups and continuous model maintenance, not large-scale study data.

The ability estimate θ̂ is updated incrementally after each response. In practice this is a Bayes expected-a-posteriori (EAP) estimate with an age-group informative prior. That keeps the estimate robust against individual unusual responses, which matters when the effective sample size per session is one.

\hat{\theta}_{\mathrm{EAP}} = \dfrac{\int \theta \cdot L(x \mid \theta) \cdot \pi(\theta)\, d\theta}{\int L(x \mid \theta) \cdot \pi(\theta)\, d\theta}

EAP estimator (4)Posterior expected value: L(x | θ) is the Rasch likelihood of the response vector x so far, π(θ) the age-group prior.

I_{\mathrm{total}}(\theta) = \sum_{i} P_i(\theta) \cdot \bigl(1 - P_i(\theta)\bigr)

Test information (5)Information accumulates additively across already presented items; the estimator error is SE(θ̂) = 1 / √I_total(θ̂).

\text{95\,\%\,CI} = \hat{\theta} \pm 1.96 \cdot \mathrm{SE}(\hat{\theta})

Confidence interval (6)Practitioners always see θ̂ with its interval, never as a point estimate.

Algorithm 3 — posterior update after each response

function update_posterior(posterior, item, x_i):
    # posterior: discrete distribution over grid points θ_1..θ_K
    # x_i ∈ {0, 1}: incorrect / correct on item with difficulty β_i
    for k in 1..K:
        p_ik         ←  1 / (1 + exp(β_i − θ_k))
        likelihood_k ←  (p_ik)^x_i  ·  (1 − p_ik)^(1 − x_i)
        posterior_k  ←  posterior_k · likelihood_k

    Z          ←  Σ_k posterior_k
    posterior  ←  posterior / Z          # normalise

    θ̂          ←  Σ_k θ_k · posterior_k    # EAP
    var        ←  Σ_k (θ_k − θ̂)^2 · posterior_k
    SE         ←  √var
    return (posterior, θ̂, SE)

Construct-specific anchoring of difficulty

The β values are not freely chosen but tied to the developmental literature on the respective construct. For theory of mind, we follow the Wellman-Liu scale (2004), which orders five task types in an empirically validated Guttman sequence: diverse desires, diverse beliefs, knowledge access, false belief, hidden emotion. That sequence provides an a-priori hierarchy of β parameters that we then calibrate against practice data.

For executive functions, we follow the three-component structure proposed by Miyake and colleagues (2000): inhibition, updating of working memory, and set shifting. The components correlate at moderate strength in the original work (r ≈ 0.42 to 0.63) but are factorially separable and do not collapse into a single ability dimension θ. We maintain separate estimates within each component.

The signals we use

Three signals enter the estimate: solution rate, response time, and error type. Solution rate is the primary signal and updates θ̂ directly. Response time is treated as an ancillary variable in the sense of the extended Rasch models (Linacre 2006). It can carry information about the degree of mastery but does not enter difficulty control as a penalty. Slower is not worse, and we explicitly do not penalise it.

Error type is a qualitative covariate. On a false-belief task, it is not irrelevant whether the child picked the own belief instead of the other person’s belief (classical ToM error) or chose an unrelated option. The second kind of error suggests attentional or comprehension problems beyond the construct itself and triggers a different adaptation decision than the first.

One signal we deliberately do not use is voluntary disengagement from a session. A child who leaves a session early is not classified as having failed. The interpretation of that signal belongs to practitioners, not to the system.

Item selection: Fisher information and frustration protection

In computerised adaptive testing (CAT), the standard criterion for item selection is the maximisation of Fisher information at the current θ̂. For the Rasch model, information simplifies to a compact form that peaks at P = 0.5, i.e. at θ̂ = β. That selection is information-theoretically optimal for diagnostic purposes.

I(\theta) = P(\theta) \cdot \bigl(1 - P(\theta)\bigr)

Fisher information (2)Peaks at 0.25 when P = 0.5; diagnostic information drops sharply as P moves away from that point.

Fisher information of the Rasch model across θ − β. Optimal for diagnostics; for practice we deliberately pick items to the right of the peak.

For a practice setting, this criterion is wrong. A constant 50-percent success rate is motivationally unfavourable because the affective load at that rate is too high, particularly for children with elevated frustration sensitivity. We shift the selection criterion to a target probability P* between 0.70 and 0.80 and pick items whose difficulty β lies below the current ability estimate.

\beta^{*} = \hat{\theta} - \ln\!\left(\dfrac{P^{*}}{1 - P^{*}}\right)

Target difficulty (3)For P* = 0.75, β* ≈ θ̂ − 1.099; the chosen item sits one logit unit below current ability.

A second correction concerns spread. Strictly deterministic selection produces recurring items, which can confound learning effects with rote memorisation. We sample the next item from a small candidate set within a difficulty window around the target value, weighted by time since the last presentation of the same item.

Algorithm 1 — item selection per practice step

function select_next_item(θ̂, P*, items, ε):
    β*       ←  θ̂  −  ln(P* / (1 − P*))
    window   ←  { i ∈ items  :  |β_i − β*|  <  ε }
    weights  ←  time_since_last_presentation(window)
    return weighted_sample(window, weights)

Repetition scheduling: the spacing effect

Repetition scheduling follows the spacing effect, one of the most robust findings in memory research since Ebbinghaus (1885). Cepeda and colleagues (2006) showed in a meta-analysis that retention probability depends non-monotonically on the inter-repetition interval: intervals that are too short waste time, intervals that are too long lead to forgetting. The optimal interval depends on the targeted retention horizon.

R(t) = \exp(-t / \tau)

Forgetting curve (7)Ebbinghaus’ exponential retention curve: τ grows with each successful repetition, flattening the decay.

\mathrm{gap}^{*} \,/\, \mathrm{RI} \approx 0.10 \ldots 0.20

Cepeda ratio (8)Empirical optimal ratio of spacing gap* to targeted retention interval RI; we calibrate spacing inside this corridor.

Three forgetting curves with growing time constant τ (1.5, 6, 24 days). Each successful repetition flattens the curve; this is the operative basis of spacing control.

Operationally we use a simplified variant of Wozniak’s SM-2 algorithm. Items carry an expanding repetition interval with an ease factor that is adjusted after each evaluation. Unlike classical vocabulary applications, the grading scale is not a subjective self-assessment but is derived from solution rate and error type. This is not original research; it is the application of an established algorithm in a new context, and we label it as such.

Algorithm 2 — spacing update after SM-2

function update_schedule(item, q):
    # q ∈ {0..5} derived from solution rate and error type
    EF  ←  EF  +  0.1  −  (5 − q) · (0.08  +  (5 − q) · 0.02)
    EF  ←  max(EF, 1.3)

    if q < 3:
        n              ←  0
        next_interval  ←  1
    else if n = 0:
        n              ←  1
        next_interval  ←  1
    else if n = 1:
        n              ←  2
        next_interval  ←  6
    else:
        n              ←  n + 1
        next_interval  ←  previous_interval · EF

    return (n, next_interval, EF)

Why no learned adaptation layer

Several families of learned models would be methodologically conceivable: Bayesian Knowledge Tracing (Corbett and Anderson 1995), Deep Knowledge Tracing (Piech and colleagues 2015), contextual multi-armed bandits (LinUCB, Thompson sampling), or full reinforcement learning. We use none of them. The reasons are not principled but sample-bound and epistemic.

BKT requires stable estimates of its four parameters per skill: prior knowledge P(L_0), learning rate P(T), slip P(S), and guess P(G). The posterior update after each response follows a Bayes rule whose value collapses without stable parameters. With small samples, these estimates are so uncertain that the model effectively stays at its priors; the added value over explicit rules vanishes.

P(L_t \mid \text{correct}) = \dfrac{P(L_t) \cdot (1 - S)}{P(L_t) \cdot (1 - S) + \bigl(1 - P(L_t)\bigr) \cdot G}

BKT update (9)Posterior probability of the "mastered" state after a correct response; analogously for incorrect ones. After the observation, P(L_t+1) = P(L_t | obs) + (1 − P(L_t | obs)) · T.

Deep Knowledge Tracing reports higher predictive accuracy than BKT in the literature but is an RNN-based black box. The question "why did the module show this exercise?" cannot be answered. For a practitioner-mediated setting in which the module is used together with a professional, that is a design defect.

Multi-armed bandits and reinforcement learning assume an explore-exploit logic in which the system deliberately picks suboptimal actions in order to learn. UCB1 (Auer et al. 2002) achieves logarithmic regret, but accumulates that regret through real wrong decisions on the child. That framing is ethically and pragmatically wrong for a clinically embedded practice module. We do not want exploration on individual children to be part of the system architecture.

R_T \,\le\, 8 \cdot \sum_{i:\, \Delta_i > 0} \dfrac{\ln T}{\Delta_i} \,+\, \left(1 + \dfrac{\pi^{2}}{3}\right) \cdot \sum_i \Delta_i

UCB1 regret bound (10)Cumulative regret after T draws for K arms with gaps Δ_i = μ* − μ_i. Asymptotically O(√(K · T · ln T)); every one of those logarithmic learners pays in real wrong decisions on the child.

Finally, every one of these models would substantially complicate the planned methodological publication. Reproducing the behaviour of a rule set is trivial; reproducing the behaviour of a trained neural network is not.

What practitioners see, and what remains open

Adaptivity is only useful in a practitioner-mediated model if it is visible. We are designing a session summary that exposes the following quantities: the current ability estimate θ̂ with confidence interval, the recently presented items with their β, the observed error types, and the adaptation triggered by the rules. With that, the practitioner can take the observations into the session without having to reverse-engineer the behaviour of the system.

Several questions are not settled. How is the target probability P* calibrated by construct and age? Which priors for the ability estimate in the initial phase are appropriate without biasing toward expected deficits? How do spacing intervals shift under therapy breaks or stretches of high school load? Which operationalisation of error type carries weight for the emotion-regulation modules, where the response dimension is not binary?

The methodological publication on our roadmap covers this work. Until then, the rule sets, parameterisations, and calibration tables we use are documented openly so that practitioners and researchers can question them and we can revise them when practice suggests we should.