Sign in
Illustrated dog with glasses holding books

Rooted in research, built with intention

Every design decision in Aristotle traces back to peer-reviewed learning science. Here's the research behind how we teach.

One-on-one tutoring is the most effective form of education ever measured

In 1984, educational psychologist Benjamin Bloom described what he called the “2 Sigma Problem.” Students who received one-on-one tutoring learned far more than those in typical classroom instruction, and no group teaching method came close. Decades of research since have pointed to the same conclusion: personalized tutoring is one of the most effective forms of instruction ever studied. The challenge has always been making it available to every student.

Students in a classroom

AI tutoring works

Intelligent tutoring systems have been studied for decades. The consistent finding: they approach the effectiveness of human tutors. Recent trials with modern AI show the gap closing further.

Tutor and student working together at a laptop

AI tutoring outperforms active learning (Harvard, 2025)

A randomized trial of 194 physics students at Harvard found that students using an AI tutor learned more, and in less time, than students in active learning classrooms with peer instruction and real-time feedback.

Read the paper
Students raising hands in a classroom

AI tutoring matches human tutors (Google DeepMind, 2025)

Across five UK secondary schools, students tutored by an AI system performed as well as those tutored by humans, and were more likely to solve novel problems on their own afterward.

Read the paper

Most AI gets teaching fundamentally wrong

General-purpose AI is optimized to answer questions. A good tutor is optimized to build understanding. These are fundamentally different objectives.

Aristotle
ChatGPT
When a student asks for help
Guides them toward the answer through questions
Gives the answer
Adapts to the student
Adjusts approach based on what the student knows and where they struggle
Responds the same way regardless of level
When a student gets it wrong
Diagnoses the reasoning behind the mistake
Corrects the answer
Visual learning
Live whiteboard with equations, graphs, and diagrams drawn in real time
Text and images only
How students engage
Students explain their thinking out loud via voice
Students read and type

Research on tutoring software found that when students can get the answer by asking repeatedly, most of them will. Students who take that shortcut learn only two-thirds as much as those who don't.

Read the research

The science behind every design decision

Each feature in Aristotle traces back to a specific finding in learning science. Click any principle to see the research.

Standard LLMs have a strong bias toward revealing solutions directly. Research in pedagogical reinforcement learning shows that overcoming this bias requires explicit training against answer-giving. SocraticLM, presented at NeurIPS 2024, demonstrated that an LLM fine-tuned for Socratic dialogue outperformed GPT-4 by over 12% on teaching performance metrics. Aristotle is built on this principle: every response asks a guiding question rather than providing the answer.

Research on the Bridge framework shows that expert tutors follow a structured process when students err: (A) identify the specific error, (B) choose a remediation strategy, and (C) form a pedagogical intention before responding. Novice tutors skip these steps, jumping straight to correction. When LLMs are guided by this expert decision-making structure, tutoring quality improves dramatically. Aristotle's error handling mirrors this expert process.

The MISTAKE framework uses cycle consistency to model the relationship between incorrect answers and their underlying misconceptions. This allows the system to anticipate why a student is confused — not just that they are — and to address the specific conceptual gap. This is fundamentally different from simply marking an answer wrong and re-explaining the procedure.

The ICAP framework, one of the most cited works in learning science, establishes a clear hierarchy of cognitive engagement: Interactive > Constructive > Active > Passive. Typing a question into ChatGPT and reading the answer is passive. Speaking through a problem with a tutor who responds to your reasoning is interactive — the highest level. Aristotle's voice-first design is a direct application of this principle: students explain their thinking out loud, producing the constructive and interactive engagement that decades of research shows produces the deepest learning.

The TRAVER framework combines knowledge tracing with turn-by-turn verification to ensure each tutor response proactively guides students toward understanding, not just toward the answer. Research shows this achieves significantly higher task completion rates than unverified tutoring approaches. Additionally, work on conversational uptake demonstrates that when tutors build on student contributions — acknowledging, reformulating, extending what the student said — student achievement improves. Aristotle verifies every response for pedagogical quality before delivery.

How the field measures AI tutoring quality

The research community has developed rigorous frameworks for evaluating whether AI tutors actually teach. We build against these standards.

Pedagogical evaluation taxonomy

The field has converged on 8 pedagogical dimensions for evaluating AI tutors, with MRBench providing the gold-standard benchmark of 192 conversations and 1,596 expert-annotated responses.

Read the paper

Scale AI’s TutorBench

A benchmark of 1,490 expert-crafted prompts evaluating adaptive explanations, actionable feedback, and hint generation. No frontier model scores above 56% — showing that raw language ability alone doesn't make a good tutor.

Read the paper

Google’s LearnLM framework

Google's evaluation-driven approach to educational AI identified key pedagogical principles and showed that educators and learners consistently preferred pedagogically fine-tuned models over standard LLMs.

Read the paper

The AI Teacher Test

Research testing GPT-3 and Blender as teachers found both were quantifiably worse than human teachers, especially on helpfulness. Raw language ability does not equal teaching ability — pedagogical fine-tuning is essential.

Read the paper

Research isn't a marketing page. It's how we build.

Built by learning scientists

Our team includes Stanford-trained researchers with backgrounds in learning science, cognitive psychology, and education technology. Over 10,000 hours of combined tutoring experience.

Continuously evaluated

We test Aristotle against published pedagogical benchmarks including MRBench and TutorEval, and every session is reviewed by humans for teaching quality.

Transparent about limitations

AI tutoring is a young field. We don't claim to have solved education. We claim to be building on the best available evidence and measuring our results honestly.

Give your child the advantage they deserve

Set up your family's Aristotle account in under two minutes. No credit card, no commitment — just research-backed tutoring, ready when they are.