GitHub All-Stars #7: Paper2Agent - The alchemy that transforms research papers into a working code
Artur Skowroński
Head of Java/Kotlin Space
Published: Oct 15, 2025|17 min read17 minutes read
I’ve always said there’s no better way to learn anything than by building something yourself… and the second-best is reviewing someone else’s code.
A sobering realization recently hit us at VirtusLab — our “collection” of starred projects had grown to enormous proportions, without bringing real value to either us or the wider community. So, we decided to change our approach: add a bit of regularity and become chroniclers of these open-source gems. That way, we’ll understand them better and discover where we can genuinely contribute.
This week, we’re diving into a project that tackles one of the most fundamental problems in the world of science and engineering: the frustrating gap between the theory described in a research paper and its practical implementation.
Anyone who’s ever tried to reproduce results from a paper knows the pain. The code, if it’s even available, is often a tangled mess of one-off scripts and Jupyter notebooks - making reproducibility, the cornerstone of science, more of an art than a craft.
This “reproducibility crisis” sets the stage for a project that aims to change that.
PS: This is my favorite XKCD in ages.
That’s why today we’re looking at Paper2Agent — a project by Jiacheng Miao from Stanford University that’s not just another simple Paper2Code-style tool. It’s a far more ambitious system, described as a “multi-agent AI system that automatically converts scientific publications into interactive AI agents.”
The real innovation behind Paper2Agent doesn’t lie in generating code from scratch. The system doesn’t just read a PDF and magically produce a program. Instead, it takes the existing GitHub repository linked to a publication and transforms its messy tutorials into reliable, reusable, and — most importantly — interactive tools. It’s an automated engine for refactoring and wrapping research code into APIs, aiming to industrialize scientific prototypes.
The problem with research code is well known — scientists are not software engineers. As a result, the source code for such works is usually written with a single goal in mind: to generate the figures and tables needed for publication. It’s fragile, full of hard-coded paths, and lacks the modularity needed for reuse.
The goal of Paper2Agent is to fundamentally change that. Instead of forcing researchers to spend hours wrestling with dependencies, the project aims to create a robust, interactive tool that allows others to easily apply the methods from a publication to their own data.
At the heart of this philosophy lies the concept of the Model Context Protocol (MCP) — and as described in the project’s paper, it consists of three key components:
MCP Tools - Executable, parameterized Python functions that encapsulate key methods from the publication. These are refactored and cleaned-up versions of the original tutorial code.
MCP Resources - A structured repository of static assets: publication text, source code, datasets, tables, etc. This provides the context and data needed by the tools.
MCP Prompts - Natural language templates that guide an AI (or human) through complex, multi-step workflows using the MCP Tools and Resources.
Examples like AlphaGenome and Scanpy illustrate this perfectly. An MCP tool for AlphaGenome can take a genetic variant as input and return predictions of its effect, fully abstracting away the complexity of model execution.
This leads to a fundamental paradigm shift in how we interact with research. Instead of cloning a repository and fighting with dependencies for hours, a researcher can connect a coding assistant like Claude Code to the generated MCP server and issue high-level commands such as: “Analyze cardiac gene expression data using the AlphaGenome MCP…” In this way, a scientific paper stops being a passive document to read and becomes an active service you can query.
In essence, MCP is an abstraction layer separating scientific methodology from its often messy implementation. The end user doesn’t need to know Python internals, library versions, or file paths. Interaction happens through high-level Tools and Prompts. The generated server behaves like a standard API endpoint — a classic software engineering pattern (creating a REST API for a complex backend) — now applied to the chaotic world of scientific code. MCP is effectively an API specification for a scientific publication, making innovation programmatically accessible and composable. It’s a key step toward building larger, automated platforms for scientific discovery.
At a high level, the architecture of Paper2Agent can be viewed as a carefully orchestrated assembly line operated by a team of specialized AI agents. The entire process is initiated by a single shell script, Paper2Agent.sh, which acts like a factory foreman—directing each agent’s work in the correct order. The workflow is resource-intensive, taking anywhere from 30 minutes to over 3 hours, and incurring a monetary cost of about $15 when using Claude Sonnet 4 for complex repositories.
When the process completes, the project directory contains a structured set of artifacts:
src/: contains the generated MCP server and tools.
<repo_name>-env/: an isolated Python environment with all dependencies.
repo/: the cloned original code repository.
claude_outputs/: JSON files with intermediate processing results.
reports/: summaries and analyses in JSON or Markdown format.
tests/: files related to testing the generated tools.
Stage 1: Environment-manager – building the environment
The first step is absolutely critical: creating an isolated, repeatable environment. This agent analyzes configuration files, prepares the workspace, and installs all dependencies. In doing so, Paper2Agent directly tackles the number-one problem in research-code reproducibility—“dependency hell”. By automating this process, it ensures that all subsequent stages operate on a stable and predictable foundation.
Stage 2: Tutorial-scanner
Once the environment is ready, the Tutorial-scanner takes over. Its task is to search the entire repository to locate valuable tutorials and distinguish them from other scripts. It functions as an intelligent filter, performing a kind of semantic understanding to identify those parts of the code that represent educational workflows—ideal candidates for transformation into reusable tools.
Stage 3: Tutorial-tool-extractor-implementor
Here we reach the very core of the system. This agent takes an unstructured Jupyter notebook—the standard medium in scientific work—extracts its core logic, parameterizes hard-coded values (such as file paths), wraps it into a clean function signature, and saves it as a modular .py file. At this point, precisely crafted prompts from the prompts/ directory are likely employed to provide the LLM with exact guidance for performing this complex refactoring task.
Stage 4: Test-verifier-improver
In the final stage, the newly generated tool undergoes rigorous testing. The Test-verifier-improver agent runs it using the tutorial’s original data and verifies whether the results—both numerical outputs and visualizations—match the original ones. It operates in a loop of generate → test → diagnose → fix, treating the paper’s reported results as concrete unit tests.
Of course, there are always corner cases in this approach…
Behind the apparent magic of Paper2Agent lies a series of deliberate engineering choices and elegant design patterns.
Pattern 1: The Orchestrator – Paper2Agent.sh At the heart of the system is the Paper2Agent.sh script, serving as the main orchestrator. It’s not a simple launch script; it’s a carefully designed state machine that guides the user through a nine-stage transformation process.
Its key feature is idempotency — the ability to resume from the last successful step. The script creates a hidden .pipeline/ directory and places empty marker files inside it (e.g., 01_setup_done, 02_clone_done) after each stage completes successfully, acting as checkpoints. Upon relaunch, the script detects these files and intelligently skips already-completed steps — invaluable for long-running processes that may be interrupted.
The entire pipeline works as follows:
Project setup (01_setup_project.sh) – Creates the main working directory for the project.
Repository cloning (02_clone_repo.sh) – Fetches the source code from the provided GitHub URL.
Folder preparation (03_prepare_folders.sh) – Creates the full folder structure (src, reports, claude_outputs, etc.), setting the stage for subsequent agents.
Adding MCP context (04_add_context7_mcp.sh) – Injects the initial MCP server configuration.
MainPaper2Agent pipeline (steps 5–8) – This is the core, where individual scripts invoke the AI agents’ logic:
Environment setup and tutorial scanning (05_run_step1_setup_env.sh) – Handles two agents: Environment-manager (creating an isolated Python environment) and Tutorial-scanner (detecting Jupyter notebooks to process).
Tutorial execution (05_run_step2_execute_tutorials.sh) – A crucial verification phase where the system runs the original notebooks to collect reference outputs and confirm that the code actually works.
Tool extraction (05_run_step3_extract_tools.sh) – Here the Tutorial-tool-extractor-implementor comes into play. The script sends notebook code to the LLM for refactoring and transformation into reusable functions.
MCP server wrapping (05_run_step4_wrap_mcp.sh) – The final stage, assembling the extracted and verified tools into the resulting _mcp.py server file.
Launching the MCP server (06_launch_mcp.sh) – The final script launches the ready-to-use server for interaction.
As you can see, a complex process is broken down into smaller, manageable, and verifiable stages. Using a simple shell script as the orchestrator — instead of a heavy framework — makes the system portable, easy to debug, and fully transparent.
Pattern 2: System Prompts
Surely no one will be surprised that the heart of the system lies in its system prompts. In Paper2Agent, we have several of them—one for each type of interaction the system performs.
1# Tutorial Execution Coordinator
2
3## Role
4An orchestrator agent that coordinates tutorial execution by managing the tutorial-executor subagent to generate gold-standard outputs from discovered tutorials. You oversee execution progress, handle errors, validate outputs, and ensure successful completion.
5
6## Core Mission
7Transform tutorial materials into executable, validated notebooks with gold-standard outputs for downstream tool extraction by coordinating systematic tutorial execution.
8
9## Subagent Capabilities
10- **tutorial-executor**: A comprehensive tutorial execution specialist that handles notebook preparation, environment management, iterative error resolution, and output generation for all tutorials.
11
12## Input Requirements
13- `reports/tutorial-scanner-include-in-tools.json`: A list of tutorials requiring execution
14- `${github_repo_name}-env`: A pre-configured Python environment for execution
15- Repository structure under `repo/${github_repo_name}/`
16- `api_key`: Optional API key for tutorials requiring external API access: "${api_key}"
17
18## Expected Outputs
19- `notebooks/${tutorial_file_name}/${tutorial_file_name}_execution_final.ipynb`: Final validated notebooks
20- `notebooks/${tutorial_file_name}/images/`: Extracted figures and visualizations
21- `reports/executed_notebooks.json`: Complete execution summary with GitHub URLs
22
23---
24
25## Execution Coordination
26
27### Phase 1: Pre-Execution Validation
28
29**Input Validation:**
Pattern 3: MCP as a Standard API
The generated file src/<repo_name>_mcp.py is the final product and a powerful abstraction pattern. This file likely uses a lightweight framework based on the FastMCP project to expose the generated tools from src/tools/ as callable methods.
Pattern 4: Self-Verifying Generation Loop
Returning to the Test-verifier-improver agent from an engineering pattern perspective, it can be described as a Test-Driven Generation pattern. The test (reproducing the original tutorial result) serves as the ultimate quality gate for generated code. The feedback loop enables the system to self-correct and minimizes the risk of LLM “hallucinations” or subtle implementation bugs.
Ever since using NotebookLM, I’ve found the idea of talking to your own sources brilliant. Paper2Agent adds a twist: thanks to MCP, these papers can talk… to each other—through agents.
Paper2Agent fits into the fast-growing field of AI-driven scientific automation. Alongside concepts like Google’s “AI co-scientist” or multi-agent frameworks such as AutoGen and LangChain, it stands out for its pragmatic focus on the “last mile”: the usability of research code. It standardizes methods from publications as MCP-accessible tools, so other agents can invoke them directly. The result? Faster scientific progress and improved reproducibility. It also has a strong educational aspect - students can interact with methods from a paper instead of merely reading about them.
The process isn’t free: it requires powerful, often proprietary models and significant compute time (around $15 and over three hours for complex repositories), which can be a barrier to entry. Furthermore, the system currently focuses on codebases with clear tutorials (mostly Python and Jupyter Notebooks) and may struggle with papers lacking code or written in other languages, such as R or C++.
While the final output is a clean API, the generation process itself relies on opaque LLM reasoning. Debugging why a tool was generated incorrectly can be extremely difficult. The system’s success is likely strongly dependent on the quality of the papers themselves.
Paper2Agent argues that the life of a scientific publication shouldn’t end at release—it should mark the beginning of its journey as a living, breathing tool for the entire community.
A fully deserved star from me - and a project that will undoubtedly inspire the next generation of tools at the intersection of AI and science.