Skip to content

Architecture

Who is this for? Contributors who need to understand how Rosetta works before changing it.

When should I read this? After Overview. Before touching MCP tools, CLI publishing, instruction content, or folder structure.

For terminology (workflow, skill, rule, subagent, bootstrap, etc.), see Overview — Key Concepts.


Two Repositories

Rosetta operates across two distinct repository types:

Instructions repository (this repo). Where common instructions are defined: skills, agents, workflows, rules, templates. Published to RAGFlow via the CLI. Maintained by instruction authors.

Target repository (any project). Where Rosetta is applied. The coding agent runs here, receives instructions from Rosetta MCP, and maintains workspace files (docs/CONTEXT.md, agents/IMPLEMENTATION.md, etc.). Maintained by developers using AI coding agents.

The instructions repo defines how agents should behave. The target repo is where agents do the work.


System Overview

┌─────────────────────────────────────────────────────────┐
│              Target Repository + IDE                    │
│  Cursor · Claude Code · VS Code · JetBrains · Codex     │
│  Windsurf · Antigravity · OpenCode                      │
│                         │                               │
│                    MCP Protocol                         │
│             (Streamable HTTP + OAuth)                   │
└────────────────────────┬────────────────────────────────┘
                         │
              ┌──────────▼──────────┐
              │    Rosetta MCP      │
              │   (ims-mcp on PyPI) │
              │                     │
              │  VFS resource paths │
              │  Bundler · Tags     │
              │  Context headers    │
              └──────────┬──────────┘
                         │
              ┌──────────▼──────────┐
              │   RAGFlow (Server)  │
              │  (document engine)  │
              │                     │
              │  parse · chunk      │
              │  embed · retrieve   │
              └──────────▲──────────┘
                         │
              ┌──────────┴──────────┐
              │    Rosetta CLI      │
              │   (tools/ims_cli)   │
              │                     │
              │  publish · parse    │
              │  verify · cleanup   │
              └──────────▲──────────┘
                         │
              ┌──────────┴──────────┐
              │  Instructions Repo  │
              │  /instructions/r2/  │
              │                     │
              │  core/ · <org>/     │
              │  skills · agents    │
              │  workflows · rules  │
              └─────────────────────┘

Instructions flow up: files are published by the CLI into RAGFlow, served by Rosetta MCP to IDEs. The coding agent never sees your source code. Rosetta only delivers knowledge and instructions.


Environments


Rosetta MCP

The MCP server is the consulting layer between IDEs and the knowledge base. It does not just proxy requests: it transforms, bundles, and contextualizes instructions so agents know how to do things right. Published on PyPI as ims-mcp. Built on FastMCP v3 (latest stable) with OAuthProxy for authentication and RAGFlow as the document engine backend. Speaks in VFS resource paths, adds context headers describing what information means and how to use it, and controls context size automatically.

Transport options:

Authentication: HTTP uses OAuth 2.1 via OAuthProxy (supports any provider: Keycloak, GitHub, Google, Azure). Cached token introspection. STDIO uses ROSETTA_API_KEY. Policy-based authorization: aia-* read-only, project-* configurable.

VFS and Tags

Everything MCP works with is VFS (virtual file system) resource paths. The CLI strips core/ and grid/ prefixes during publishing, so core/skills/planning/SKILL.md and grid/skills/planning/SKILL.md both become skills/planning/SKILL.md. Files at the same resource path get bundled together.

Tags are the primary access mechanism. ACQUIRE <path> FROM KB queries by tags, which provides the most direct and fastest access. The CLI’s auto-tagging was designed specifically for this: every folder name, filename, and composite pair/triple becomes a tag, so agents can request exactly what they need. Keyword search via SEARCH is the fallback for discovery.

MCP Tools

Eight tools and one resource exposed to agents:

Tool Purpose
get_context_instructions Bootstrap: load all rules and guardrails bundled (prep step 1)
query_instructions Fetch instruction docs by tags (primary) or keyword search (fallback)
list_instructions Browse the VFS hierarchy (flat listing of immediate children)
query_project_context Search project-specific docs in a target repo dataset
store_project_context Create or update a document in a project dataset
discover_projects List readable project datasets
plan_manager Manage execution plans with phases, steps, dependencies, status. Has a help command for plan creators (subagents don’t need it). Stores plan in REDIS.
submit_feedback Auto-submit structured feedback on agent sessions

Resource: rosetta://{path} reads bundled instruction documents by VFS resource path.

Bundler

The Bundler merges multiple documents at the same VFS resource path into a single XML response. When an agent ACQUIREs a skill, core and organization files at that path are concatenated into one payload:

<rosetta:file id="..." dataset="..." path="skills/planning/SKILL.md" name="..." tags="..." frontmatter="...">
  [document content from core]
</rosetta:file>
<rosetta:file id="..." dataset="..." path="skills/planning/SKILL.md" name="..." tags="..." frontmatter="...">
  [document content from organization overlay]
</rosetta:file>

Documents sorted by sort_order (default: 1000000), then by name. INSTRUCTION_ROOT_FILTER controls which layers are included (e.g., CORE,GRID).

Listing

Listing shows what exists in the VFS without loading content. Implemented by list_instructions to browse the instruction hierarchy. Two formats:

XML format (default) includes metadata attributes:

<rosetta:folder dataset="..." path="skills/" />
<rosetta:folder dataset="..." path="rules/" />
<rosetta:file id="..." path="skills/planning/SKILL.md" name="..." tags="..." frontmatter="..." />

Flat format returns resource paths only:

skills/planning/SKILL.md
skills/coding/SKILL.md
rules/guardrails.md

A full instruction suite listing is ~400 tokens. Frontmatter attributes (extracted by CLI during publishing) let agents understand document purpose from the listing alone, without follow-up reads.

Context Overflow Prevention

MCP manages context size through two mechanisms:

Command Aliases

Command aliases are used exclusively for Rosetta MCP resources (instructions, knowledge base, project datasets). Workspace files in the target repository (docs/CONTEXT.md, agents/IMPLEMENTATION.md, etc.) are read directly from the filesystem. This boundary is intentional: when an agent sees ACQUIRE ... FROM KB, it knows it is calling Rosetta MCP; when it reads a file, it knows it is working with target repository files.

Instructions never call MCP tools directly. Rosetta defines command aliases that work across all IDEs and coding agents. This serves three purposes:

Alias Maps to
GET PREP STEPS get_context_instructions()
ACQUIRE <path> FROM KB query_instructions(tags="<path>")
SEARCH <keywords> IN KB query_instructions(query="<keywords>")
LIST <folder> IN KB list_instructions(full_path_from_root="<folder>")
USE SKILL <name> Load skill (fetches SKILL.md internally)
INVOKE SUBAGENT <name> Call subagent (fetches agents/<name>.md)
USE FLOW <name> Use workflow or command
ACQUIRE <file> ABOUT <project> query_project_context(repository_name, tags)
QUERY <keywords> IN <project> query_project_context(repository_name, query)
STORE <file> TO <project> store_project_context(repository_name, ...)
/rosetta Engage only the Rosetta flow

ACQUIRE expects a VFS resource path: filename, parent/filename, or grandparent/parent/filename. LIST preferred over SEARCH when the folder is known.

Bootstrap Flow

One get_context_instructions call returns all bootstrap rules bundled (core policy, execution policy, guardrails, HITL, rosetta files description). Three prep steps guide the agent on what to do next:

1. Agent connects to Rosetta MCP

2. Server + tool instructions enforce: "call get_context_instructions first"

3. Prep Step 1 — get_context_instructions
   └── Returns bundled bootstrap-* rules: core policy, execution policy,
       guardrails, HITL questioning, workspace file definitions

4. Prep Step 2 — Load project context (direct file reads from target repository)
   └── Read CONTEXT.md, ARCHITECTURE.md; grep headers of other workspace files

5. Prep Step 3 — Classify and route
   └── LIST workflows IN KB; ACQUIRE matching workflows
       Agent now has: bootstrap rules + project context + workflow instructions

6. Agent executes the workflow
   ├── Follows phases (Prepare → Research → Plan → Act)
   ├── Uses ACQUIRE/USE SKILL/INVOKE SUBAGENT to load instructions progressively
   ├── Delegates to subagents, uses plan_manager for tracking
   └── Applies guardrails and HITL gates throughout

All three prep steps are mandatory regardless of task size. The agent calls get_context_instructions exactly once per session.

Key environment variables: ROSETTA_SERVER_URL, ROSETTA_API_KEY, INSTRUCTION_ROOT_FILTER, REDIS_URL

For MCP setup across all IDEs, see Get Started.


RAGFlow (Rosetta Server)

RAGFlow is the document storage and retrieval engine. Rosetta uses it for ingestion, parsing, embedding, and search. Not exposed to end users directly.

Deployment: Local via Docker Compose at http://localhost:80 (development) or hosted instance (production).

Processing pipeline: Upload (upsert by deterministic UUID) → Parse (server-side) → Chunk → Embed → Index. Repeated publishes are idempotent.

Datasets:

Dataset Purpose
aia Base fallback (files without a release)
aia-r1 R1 release (stable)
aia-r2 R2 release (current)
project-* Per-repository collections in target repos (per OAuth policy)

Instruction dataset names auto-generated from template aia-{release}.

All prefixes are internal only, it must not be exposed or received. This prevents cross-dataset security issues. Any user of MCP must not be aware of those existence.

Metadata per document: tags, domain, release, content_hash (MD5), resource_path, sort_order, frontmatter, original_path, line_count.

For RAGFlow internals, see Rosetta Server.


Rosetta CLI

The CLI (tools/ims_cli.py) publishes instructions from the instructions repository into RAGFlow. It handles change detection, metadata extraction, frontmatter parsing, and auto-tagging.

Core commands:

Command What it does
publish ../instructions Publish changed files (incremental, MD5-based)
publish ../instructions --force Republish all files regardless of changes
publish ../instructions --dry-run Preview what would be published
parse Trigger server-side document parsing
verify Test connection and health
list-collection --collection aia-r2 List documents in a dataset
cleanup-collection --collection aia-r2 Delete documents from a dataset

Critical rule: Always publish the entire /instructions folder. Never subfolders or single files (breaks tag extraction).

Change detection: MD5 hash of content. Only modified files publish (~77% time savings). Use --force to bypass.

Auto-tagging and metadata extraction. The CLI reads each file during publishing and extracts everything MCP needs to serve it efficiently:

Environment: .env.dev (dev RAGFlow) or .env.prod (production). Switch with cp .env.dev .env.

For deployment details, see Deployment.


Instruction Structure

Instructions live in /instructions/r2/ in the instructions repository, using a layered folder structure.

/instructions/r2/
├── core/                  ← OSS foundation (ships with Rosetta)
│   ├── skills/
│   │   └── <name>/
│   │       ├── SKILL.md
│   │       ├── references/
│   │       └── assets/
│   ├── agents/
│   │   └── <name>.md
│   ├── workflows/
│   │   ├── <name>.md
│   │   └── <name>-<phase>.md
│   ├── rules/
│   │   └── <name>.md
│   └── commands/
│
└── <org>/                 ← Organization extensions (e.g., grid/)
    ├── skills/
    ├── agents/
    ├── workflows/
    ├── rules/
    └── commands/

Layered customization. Core provides the universal foundation. Organization folders extend or override it. Files at the same VFS resource path get bundled together by the Bundler. INSTRUCTION_ROOT_FILTER controls which layers are included (e.g., CORE,GRID).

Component relationships. Workflows invoke subagents. Subagents use skills. All reference rules. Templates live inside skills. Guardrails are rules. See Overview — Key Concepts for definitions.

Naming. Lowercase, dash-separated, globally unique filenames. Entry points: SKILL.md for skills, <name>.md for agents, workflows, and rules.


Workspace Files

Rosetta initializes and maintains a standard file structure in target repositories. These files are how the agent tracks project context, implementation state, and execution plans. All are SRP, DRY, MECE, concise, with grep-friendly topical headers.

Project documentation (docs/):

Agent state (agents/):

Execution (plans/):

Other:

Prep step 2 loads CONTEXT.md and ARCHITECTURE.md from the target repository. The agent updates IMPLEMENTATION.md and MEMORY.md as it works. See Installation — Workspace Files Created for the full list of committed and excluded files.

State management and recovery. For medium and large tasks, workflows create plan, spec, and state files in plans/ and agents/. These files persist execution state to disk, so if a failure occurs (context loss, crash, timeout), the agent or a new session can resume from the last recorded state rather than starting over.


Data Flow

Instructions Repo ──► CLI (publish) ──► RAGFlow ──► Rosetta MCP ──► Target Repo + IDE
  1. Publish. CLI reads .md files from instructions repo, extracts tags + frontmatter + metadata, generates deterministic UUID, upserts into dataset
  2. Index. RAGFlow parses, chunks, embeds, indexes for full-text and semantic search
  3. Bootstrap. Agent calls get_context_instructions via MCP (prep step 1), reads workspace files directly from the target repo (step 2), classifies request via MCP (step 3)
  4. Load. Agent uses ACQUIRE/SEARCH/LIST aliases. MCP queries by tags, bundles matching VFS paths into XML with context headers. Progressive disclosure: only what the workflow needs
  5. Execute. Workflow phases (Prepare → Research → Plan → Act), subagent delegation, plan_manager tracking, guardrails and HITL gates

Development

Prerequisites

Publishing Instructions

Publish instructions to remote IMS server:

cd tools
source venv/bin/activate
cp .env.dev .env
python ims_cli.py verify
python ims_cli.py publish ../instructions

Pipelines

We use .github/workflows pipelines to build and release: MCP PyPi package, Docker Image, Publish Instructions, Publish website. Triggers on push to main or manual dispatch.

Website: builds the Jekyll website from docs/web/, deploys to GitHub Pages.

Plugin distribution. Three packages via marketplace:

Plugin Contents, Footprint
core@rosetta Full OSS foundation
grid@rosetta Enterprise extensions
rosetta@rosetta Bootstrap rule + MCP definition only, (fetches via MCP)

Plugins point to source folders in the instructions repository. No local file duplication.


Extension Points

Where contributors add or change things:

After adding or changing instructions, publish with the CLI to make them available via MCP. See the Developer Guide — Where to Change What for the validation steps per change type.


Tradeoffs