Decades of research have shown that access to technology alone does not improve education outcomes. And with generative AI, there is a real risk of making learning too frictionless—giving students answers before they’ve done the hard thinking that actually builds skills
An earlier study out of Turkey raised exactly this concern. In a randomized trial of 1,000 high school students, those using GPT-4 performed much better on practice problems but scored 17 percent worse once the AI was removed. The likely culprit was overreliance: students asked for answers, copied them, and moved on. But the researchers also showed that a more structured prompt that offered step-by-step hints rather than direct answers largely eliminated the negative effects. In some ways, that was the most important finding: the design of the AI tutor mattered enormously.
Two new studies reinforce this. The first set out to test what happens when you build an intelligent system that actively guides what students practice, when, and at what difficulty level. More than 700 high school students over five months across ten schools received the same lectures, course material, and GenAI chatbot tutor. The only difference was that half received a fixed sequence of practice problems (easy to hard, standard practice), while the other half had their problem sequence dynamically personalized by a reinforcement learning (RL) algorithm.
The RL system used LLMs (GPT-4o, Claude) to extract signals from student behavior, such as analyzing code edits, distinguishing meaningful solution attempts from superficial resubmissions, and evaluating the quality of conversations. These signals provided a continuous estimate of each student’s evolving knowledge state, enabling the system to select problems at the appropriate difficulty level in real time.

Students with adaptive sequencing scored 0.15 standard deviations higher on an in-person, handwritten final exam with no devices or AI assistance, by some estimates, equivalent to six to nine months of additional schooling, with no extra instruction time or additional teacher workload. Beginners with no prior experience saw the largest gains, and students at lower-tier schools benefited more than those at elite schools. Most interesting, the gains weren’t driven by students completing more problems or being pushed toward harder material. They were driven almost entirely by increased engagement: more time on task, more attempts, and higher-quality interactions with the chatbot. The system didn’t make learning easier. It made it easier to do the hard work of learning.
A second study, an NBER working paper, points in a similar direction from a different angle. The researchers designed and tested a collaborative filtering recommendation system for an English-learning app serving children in India. Rather than adapting difficulty, this system personalized which stories children were shown, matching kids with content they’d actually want to read.
In an RCT with 7,750 users, personalization increased engagement by 60 percent in the targeted section and 14 percent app-wide. Users also spent more time reading and completed more difficult stories. Over five months of scaled deployment, the estimated effects were about 2.5 times larger than the experimental benchmark.
These studies push back on the fear that AI in education inevitably becomes a crutch. But they also point to what comes next.
First, design is critically important, and that starts with how the prompts and systems are structured. It’s not just about what they’re trying to maximize but equally about what behaviors they’re trying to minimize. The Turkey study showed that an unconstrained chatbot can undermine learning; add the right guardrails in through just the prompt, and the harm largely disappears. Constraints matter as much as capabilities.
Second, the models themselves are getting dramatically better. In just two years, we’ve gone from “AI gets math wrong” to “AI beats PhD experts on one of the hardest graduate-level reasoning benchmarks.” These are exponential improvements, not incremental gains. The tutoring systems in these studies were built on models that are generations behind what’s available today through ChatGPT 5.4, Gemini 3.1 Pro, and Claude Opus 4.6.

Third, the agentic wave sweeping Silicon Valley has yet to reach education in any meaningful way, and its arrival could produce far better tutors. Imagine a coordinated team of AI agents: one diagnosing exactly where a student’s understanding breaks down, another selecting problems tied to the curriculum, another deciding when to offer a hint versus when to let the student struggle, another evaluating the quality of the student’s reasoning rather than just checking answers, and a critic auditing the others to catch errors and prevent the system from becoming a crutch. The single AI chatbot tutor may prove far less effective than a team of specialists working in concert with both student and teacher.
AI works in education when it’s used not to give answers but to orchestrate the conditions for productive struggle. The result is deeper engagement and stronger outcomes.
The post AI Works in Education When It Makes Learning Harder, Not Easier appeared first on American Enterprise Institute – AEI.












