Running Karpathy's Autoresearch with Local LLM for Zero Cost Autonomous AI Research

Authors
  • avatar
    Name
    Nino
    Occupation
    Senior Tech Editor

Andrej Karpathy, co-founder of OpenAI and former Director of AI at Tesla, recently unveiled a provocative experiment called autoresearch. The premise is deceptively simple but profoundly powerful: an LLM agent is tasked with autonomously modifying a GPT training script, running short experiments, evaluating the results, and iteratively improving the model's performance. While the original implementation relied on Claude Code (a cloud-based API), the developer community has already pushed the boundaries further.

In this tutorial, we will explore how to run a local fork of Karpathy's autoresearch using Qwen 3.5 9B via Ollama. This approach allows for zero API costs and full data sovereignty. For developers who need even higher reasoning capabilities without the overhead of local hardware, platforms like n1n.ai provide the necessary high-speed access to models like Claude 3.5 Sonnet and DeepSeek-V3 to power similar autonomous loops.

The Core Concept: Autonomous ML Engineering

Karpathy's autoresearch operates on a continuous feedback loop. The "Researcher" (the LLM) is given access to a training script (train.py) and a history of previous experiments (results.tsv). Its goal is to minimize val_bpb (bits per byte), a standard metric for character-level language models.

The original version used Claude 3.5 Sonnet, which is widely regarded as one of the best models for coding tasks. However, the cost of running hundreds of iterations can add up. By switching to a local model like Qwen 3.5 9B, we can run the loop indefinitely for the cost of electricity alone.

Hardware and Architecture Strategy

The primary challenge in running both the LLM and the training script locally is VRAM management. A typical setup requires a single high-end GPU (like an NVIDIA RTX 3090 or 4090 with 24GB VRAM, or ideally an A6000 with 48GB).

Here is how the VRAM is typically partitioned in the local fork:

  • Researcher (Qwen 3.5 9B via Ollama): Occupies approximately 12GB VRAM.
  • Subject (GPT Training via train.py): Occupies approximately 35GB VRAM (on a 48GB card) or scaled down for 24GB cards.

To accommodate the dual-load, the local fork modifies the training hyperparameters compared to Karpathy's original baseline:

ComponentKarpathy OriginalLocal LLM Fork
Model Depth8 layers4 layers
Batch Size12864
Total Tokens524K65K
Window PatternSSSLL

While the model being trained is smaller, the intelligence of the research remains high because the agent can execute more iterations in a shorter timeframe without worrying about API rate limits.

Setting Up the Local Environment

To get started, you need Ollama installed and the Qwen 3.5 9B model pulled. If you find your local hardware struggling with the 9B model's reasoning, you can always bridge the gap by using n1n.ai to access larger models via a unified API, ensuring your autonomous agent doesn't get stuck on complex logic.

  1. Install Ollama and Pull Model:
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
ollama pull qwen3.5:9b
  1. Clone the Repository and Sync Dependencies:
git clone https://github.com/SohniSwatantra/autoresearch-local-llm.git
cd autoresearch-local-llm
pip install uv
uv sync
  1. Run the Pipeline:
bash run_pipeline.sh

Deep Dive into agent.py

The heart of the system is agent.py. This script handles the communication between the training logs and the LLM. It follows a strict protocol:

  1. Context Loading: It reads the current train.py and the results.tsv file.
  2. Prompting: It asks the LLM to propose a change to train.py that might lower the loss.
  3. Validation: It uses ast.parse() to ensure the LLM hasn't generated syntactically incorrect Python code.
  4. Execution: It overwrites train.py, commits the change to Git, and runs the 5-minute training job.
  5. Evaluation: If val_bpb improves, the change is kept. If not, it performs a git reset --hard to return to the last known good state.

One of the most elegant parts of the agent.py implementation is the code extraction logic:

import re
import ast

def extract_code_from_response(response):
    # Using regex to find the python block
    blocks = re.findall(r"```(?:python)?\s*\n(.*?)\n```", response, re.DOTALL)
    if not blocks:
        return None

    # Select the longest block as it's usually the full script
    candidate = max(blocks, key=len)

    try:
        ast.parse(candidate)
        return candidate
    except SyntaxError:
        return None

Why Qwen 3.5 9B?

Choosing the right model for the "Researcher" role is critical. While Claude 3.5 Sonnet is the gold standard, Qwen 3.5 9B has shown remarkable proficiency in following system instructions and generating valid Python code. In our tests, it successfully suggested architectural changes like moving from standard self-attention to Grouped Query Attention (GQA) and adjusting learning rate schedulers.

If you require even more sophisticated research capabilities—such as the reasoning-heavy "Think" mode found in DeepSeek-R1—you can integrate those models via n1n.ai to see if a more powerful model yields better training breakthroughs than the 9B local variant.

Design Philosophies for Autonomous Research

Karpathy outlined several key principles in his program.md that are preserved in this local fork:

  1. Never Stop: The loop is designed to run indefinitely. You can start the process before going to sleep and wake up to 100+ completed experiments.
  2. Simplicity Over Complexity: The agent is encouraged to discard "hacky" code. If a 20-line change only improves the result by 0.001, it is often better to stick to the simpler baseline.
  3. Assume the User is Sleeping: The system must be robust enough to handle crashes. The fork includes a failsafe that resets the code to the baseline after 3 consecutive crashes.

Cost Comparison: Cloud vs. Local

SetupCost per Experiment100 Experiments
Original (Claude API)~$0.15$15.00
Local Fork (Dedicated GPU)$0.00$0.00
Local Fork (Rented GPU)~$0.08$8.00

For researchers running thousands of iterations, the savings are substantial. However, the trade-off is the initial hardware investment.

Conclusion

Running Karpathy's autoresearch locally is a glimpse into the future of software engineering and machine learning. We are moving from a world where humans write code to a world where humans define the constraints and metrics, and AI agents explore the solution space.

Whether you choose to run entirely locally or use a high-performance aggregator like n1n.ai to power your agents, the era of autonomous research is here. By automating the "boring" parts of ML experimentation—hyperparameter tuning and minor architectural tweaks—we free ourselves to focus on high-level conceptual breakthroughs.

Get a free API key at n1n.ai