The Complete Guide

Agentic Skills

What they are, how every major platform handles them, where the gaps are, and why trust scoring changes everything.

Last updated: February 2026·~15 min read

What is An agentic skill?

An agentic skill is a modular capability package that extends what an AI coding agent can do. At its simplest, it's a SKILL.md file — a markdown document with YAML frontmatter (name, description, metadata) plus natural-language instructions — that an agent reads and follows when the right context arises.

The format was introduced by Anthropic when they open-sourced the Agent Skills specification on December 18, 2025. Within weeks, it was adopted by 26+ platforms including OpenAI Codex, GitHub Copilot, VS Code, Cursor, Gemini CLI, Windsurf, Manus, and Amp. Simon Willison called it “deliciously tiny” — and that minimalism is exactly why it spread so fast.

A SKILL.md uses progressive disclosure: metadata loads first (~100 tokens), then full instructions (under 5K tokens), then resources (scripts, references, assets) are pulled on demand. This prevents skills from overwhelming the agent's context window.

The .skill ZIP package bundles everything together: a SKILL.md file plus optional scripts/, references/, assets/, and templates/ directories. This is the distribution format used by skill marketplaces and team repositories.

How the Major Platforms Handle Skills

Every major AI coding platform now supports the Agent Skills format — but each implements it differently and none provides a trust layer.

Claude

Anthropic
Docs

Originator of the open Agent Skills standard. Three-level progressive disclosure: metadata (under 100 tokens) loads first, then full instructions (under 5K tokens), then resources on demand. VM-based execution in Claude Code. Pre-built skills for PowerPoint, Excel, Word, PDF.

Strengths
  • + Defined the standard adopted by 26+ platforms
  • + Progressive disclosure minimizes context usage
  • + Pre-built skills for common enterprise tasks
Gaps
  • No trust scoring or quality verification
  • Security warnings say "exercise extreme caution" but provide no mechanism to verify trust
  • No marketplace or discovery beyond manual directory browsing

OpenAI Codex

OpenAI
Docs

Built-in $skill-creator generates skills from natural language. Six scope levels (REPO, USER, ADMIN, SYSTEM) control where skills apply. Optional agents/openai.yaml for UI metadata and tool dependencies. Supports allow_implicit_invocation policy toggle.

Strengths
  • + Automatic skill generation from natural language
  • + Granular scope control (REPO to SYSTEM level)
  • + UI-level metadata via openai.yaml for rich display
Gaps
  • No trust scoring
  • No cross-platform portability guarantees
  • Generated skills have no quality verification before use

Manus

manus.im
Docs

Secure sandbox VM execution — unlike Claude's local machine execution, Manus isolates skill execution in controlled environments. One-click import from skill directories. "Less structure, more intelligence" philosophy. Skills + MCP as complementary capabilities.

Strengths
  • + Sandboxed VM execution (stronger isolation)
  • + One-click import from skill directories
  • + Skills and MCP tools work as complementary layers
Gaps
  • No trust scoring or quality metrics
  • No persistent learning between sessions
  • No governance layer beyond model alignment

Cursor

Anysphere
Docs

Cross-platform directory support reads skills from .cursor/, .claude/, .codex/, and more. GitHub skill installation via URL. /migrate-to-skills converter for existing setups. disable-model-invocation toggle prevents agent-initiated activation.

Strengths
  • + Cross-platform skill directory support
  • + GitHub-based skill installation
  • + Migration tools for existing configurations
Gaps
  • No trust scoring
  • No quality verification before loading
  • No marketplace or curation layer

The Gap Nobody Has Solved

Every platform evaluates skills internally, but none exports a portable trust score. There is no npm audit for agent skills. No Lighthouse score. No quality gate. The ecosystem has ~100,000 published skills and growing — with no immune system.

The ClawHavoc security incident (February 2026) exposed 824 malicious skills on ClawHub alone, with Bitdefender confirming 17% of audited skills contained malware. These weren't edge cases — they were skills with GitHub stars, install counts, and apparent legitimacy.

Hugging Face's upskill research revealed something arguably worse: skills can degrade agent performance. Curated skills improved model output by +16.2 percentage points on average, but self-generated skills actually decreased performance by -1.3 points. A skill optimized for Claude Opus may actively harm Claude Haiku. Without per-model quality scoring, developers deploy capability-degrading instructions and have no way to know.

Claude's own documentation warns developers to “exercise extreme caution” with untrusted skills — but provides no mechanism to actually verify that trust. The gap between warning and tooling is where risk lives.

CapabilityClaudeCodexManusCursorYawn
Trust Score××××
Quality Verification××××
Safety Scanning××××
Cross-Model Testing××××
Governance Gates××××
Drift Detection××××
Evidence/Proof Layer××××
Economic Model××××

The Yawn Skill Trust Score

The Yawn Skill Trust Score evaluates every skill file across 35+ checks organized into four weighted dimensions. The score produces a composite 0–100 value displayed as an F-to-A+ letter grade — universally understood, screenshot-worthy, and competitive.

Functional (25% weight)

Does the skill define what it does, when to trigger, what it needs, and what it produces?

Name and description fieldsPurpose statementInputs & outputsActivation triggerCode examplesScript referencesProgressive disclosure complianceImperative instructions

Quality (20% weight)

Is it well-structured, properly formatted, and documented with examples?

YAML frontmatterDescription within spec limitNo XML tags in metadataHeading hierarchyExamples sectionStep-by-step instructionsContent depth

Safety (30% weight)

Does it handle errors, define constraints, avoid dangerous patterns, and respect boundaries?

Safety constraints definedError handling guidanceNo code injection patterns (eval/exec)No risky external URL fetchesNo exposed credentialsNo privilege escalationScope boundariesData privacy awareness

Maintenance (25% weight)

Is it versioned, cross-platform compatible, tested, and attributed?

Version trackingModel compatibilityTesting/evidenceLicenseAuthor attributionDependencies documentedCross-platform portabilityDocument structure

How scoring works

Each dimension score is the percentage of checks passed within that dimension. The composite score uses weighted aggregation: Functional × 0.25 + Quality × 0.20 + Safety × 0.30 + Maintenance × 0.25.

Safety floor: If the safety score falls below 30, the composite is capped at D regardless of how well other dimensions score. An unsafe skill that “works” is worse than a safe skill that doesn't yet.

The heuristic engine runs entirely in the browser — zero API calls, instant results, no file upload. The optional Deep Scan sends content to an LLM for semantic analysis of instruction clarity, cross-model compatibility, and hidden safety gaps that static checks cannot detect.

The .yawn Holarchy Model

The missing piece in every existing agent skill standard is holonic composition — skills that contain sub-skills, forming a navigable tree. In the .yawn system, the skill holarchy is simultaneously:

  • Composition — which skills are needed to compose a capability
  • Navigation — the URL path maps directly to the tree position
  • Permission — depth in the tree determines autonomy requirements

yawn.ai/yawn/skills

├── core/ — fundamental capabilities (L4 autonomous)

│ ├── llm-invoke/ ├── json-parse/ ├── rate-limit/

├── perception/ — input understanding (L3-4)

│ ├── entity-extract/ ├── intent-classify/

├── safety/ — guardrails (L1-2 human required)

│ ├── risk-assess/ ├── kernel-pdp/

└── orchestration/ — coordination (L2-3)

└── desired-outcome-matcher/

Each skill tracks a loop status: green (loop closed — active, handler, evidence), yellow (partial — exists but incomplete), red (gap — stub, planned, or blocked). Unmatched slots are lacunae — explicitly identified gaps that the system surfaces for resolution.

Why This Matters Now

The EU AI Act's high-risk system requirements take full effect August 2, 2026 — demanding documented risk management, technical transparency, human oversight, and conformity assessments. Penalties reach €35M or 7% of global revenue. For enterprises deploying agents with skills in regulated industries, every consumed skill becomes part of the compliance surface.

Gartner predicts over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear business value, or inadequate risk controls. The “inadequate risk controls” signal is the market opening for trust infrastructure.

The AI agent market reached $5.4 billion in 2024 and is projected to hit $236 billion by 2034 (45.8% CAGR). AI agents could mediate $3–5 trillion in global consumer commerce by 2030 according to McKinsey. The ecosystem has ~100,000 published skills, growing exponentially, with no quality immune system.

The FICO analogy is precise: credit scoring transformed lending from guesswork to data-driven risk assessment. Trust scoring will do the same for agentic AI. The first trust layer to become the default — the way npm audit became default, the way Dependabot became default on GitHub — wins the market.

The window is approximately 12–18 months before the market consolidates. The message that makes non-adoption feel irresponsible is straightforward: “Your AI agents have access to your codebase, your APIs, and your customer data. They're consuming skills published by strangers with no proof of quality. Would you run npm packages without npm audit?”

Vocabulary

New categories require new language. These terms define the trust layer for agent skills.

Skill Trust Score
A composite F-to-A+ grade derived from four weighted dimensions (Functional, Quality, Safety, Maintenance). Computed via 35+ heuristic checks and optional AI deep scan. The number your team optimizes for.
Trust Gate
A CI/CD checkpoint that blocks unscored or low-trust skills from reaching production agents. Analogous to npm audit or SonarQube's Quality Gate — pipeline halts if the skill fails the trust threshold.
Skill Drift
When a skill's effectiveness degrades silently as the underlying models update. A skill tested on Claude Sonnet 3.5 may produce different (worse) results on Sonnet 4 without any change to the skill itself.
Trust Debt
The accumulated risk from unverified skills consuming agent-level permissions. Like technical debt, it compounds — each unscored skill is a liability with access to your codebase, APIs, and customer data.
Scored Skill
A skill with a verified trust score. The new baseline. Anything without a score is unscored — and that is a choice you are making about what gets access to your systems.
Deep Scan
An AI-powered semantic analysis tier that goes beyond heuristic checks. Uses an LLM to evaluate instruction clarity, cross-model compatibility, hidden safety gaps, and architectural quality that static checks cannot detect.

Frequently Asked Questions

Score your first skill free

Drop a SKILL.md or .skill file. Get 35+ trust checks across four dimensions. Results in seconds. No signup.