How many checks does the Yawn Skill Trust Score run?

The heuristic engine runs 35+ checks across four dimensions: Functional (10 checks), Quality (8 checks), Safety (8 checks), and Maintenance (9 checks). Each check maps to specific requirements from the Claude, OpenAI Codex, Manus, and Cursor skill specs.

Why is safety weighted highest at 30%?

Because an unsafe skill that works is worse than a safe skill that doesn't yet. Skills have access to codebases, APIs, and potentially customer data. The 17% malware rate found in community skills demonstrates that functional quality alone is insufficient.

Does the EU AI Act affect how I should manage agent skills?

Yes. The EU AI Act's high-risk system requirements take full effect August 2, 2026, demanding documented risk management, technical transparency, and conformity assessments. Every skill your agents consume becomes part of your compliance surface.

Can I score skills from any platform?

Yes. The Agent Skills format is an open standard adopted across all major platforms. Any SKILL.md file or .skill ZIP package can be scored regardless of which platform it was created for.

What Are Agentic Skills? The Complete Guide to AI Agent Skills

Q: What is an agentic skill?

An agentic skill is a modular capability package — typically a SKILL.md file with YAML frontmatter and markdown instructions — that extends what an AI coding agent can do. The format was introduced by Anthropic in December 2025 and adopted by 26+ platforms including OpenAI Codex, Cursor, Manus, and GitHub Copilot.

What is An agentic skill?

An agentic skill is a modular capability package that extends what an AI coding agent can do. At its simplest, it's a SKILL.md file — a markdown document with YAML frontmatter (name, description, metadata) plus natural-language instructions — that an agent reads and follows when the right context arises.

The format was introduced by Anthropic when they open-sourced the Agent Skills specification on December 18, 2025. Within weeks, it was adopted by 26+ platforms including OpenAI Codex, GitHub Copilot, VS Code, Cursor, Gemini CLI, Windsurf, Manus, and Amp. Simon Willison called it “deliciously tiny” — and that minimalism is exactly why it spread so fast.

A SKILL.md uses progressive disclosure: metadata loads first (~100 tokens), then full instructions (under 5K tokens), then resources (scripts, references, assets) are pulled on demand. This prevents skills from overwhelming the agent's context window.

The .skill ZIP package bundles everything together: a SKILL.md file plus optional scripts/, references/, assets/, and templates/ directories. This is the distribution format used by skill marketplaces and team repositories.

How the Major Platforms Handle Skills

Every major AI coding platform now supports the Agent Skills format — but each implements it differently and none provides a trust layer.

Claude

Anthropic

Docs

Originator of the open Agent Skills standard. Three-level progressive disclosure: metadata (under 100 tokens) loads first, then full instructions (under 5K tokens), then resources on demand. VM-based execution in Claude Code. Pre-built skills for PowerPoint, Excel, Word, PDF.

Strengths

+ Defined the standard adopted by 26+ platforms
+ Progressive disclosure minimizes context usage
+ Pre-built skills for common enterprise tasks

Gaps

– No trust scoring or quality verification
– Security warnings say "exercise extreme caution" but provide no mechanism to verify trust
– No marketplace or discovery beyond manual directory browsing

OpenAI Codex

OpenAI

Docs

Built-in $skill-creator generates skills from natural language. Six scope levels (REPO, USER, ADMIN, SYSTEM) control where skills apply. Optional agents/openai.yaml for UI metadata and tool dependencies. Supports allow_implicit_invocation policy toggle.

Strengths

+ Automatic skill generation from natural language
+ Granular scope control (REPO to SYSTEM level)
+ UI-level metadata via openai.yaml for rich display

Gaps

– No trust scoring
– No cross-platform portability guarantees
– Generated skills have no quality verification before use

Manus

manus.im

Docs

Secure sandbox VM execution — unlike Claude's local machine execution, Manus isolates skill execution in controlled environments. One-click import from skill directories. "Less structure, more intelligence" philosophy. Skills + MCP as complementary capabilities.

Strengths

+ Sandboxed VM execution (stronger isolation)
+ One-click import from skill directories
+ Skills and MCP tools work as complementary layers

Gaps

– No trust scoring or quality metrics
– No persistent learning between sessions
– No governance layer beyond model alignment

Cursor

Anysphere

Docs

Cross-platform directory support reads skills from .cursor/, .claude/, .codex/, and more. GitHub skill installation via URL. /migrate-to-skills converter for existing setups. disable-model-invocation toggle prevents agent-initiated activation.

Strengths

+ Cross-platform skill directory support
+ GitHub-based skill installation
+ Migration tools for existing configurations

Gaps

– No trust scoring
– No quality verification before loading
– No marketplace or curation layer

The Gap Nobody Has Solved

Every platform evaluates skills internally, but none exports a portable trust score. There is no npm audit for agent skills. No Lighthouse score. No quality gate. The ecosystem has ~100,000 published skills and growing — with no immune system.

The ClawHavoc security incident (February 2026) exposed 824 malicious skills on ClawHub alone, with Bitdefender confirming 17% of audited skills contained malware. These weren't edge cases — they were skills with GitHub stars, install counts, and apparent legitimacy.

Hugging Face's upskill research revealed something arguably worse: skills can degrade agent performance. Curated skills improved model output by +16.2 percentage points on average, but self-generated skills actually decreased performance by -1.3 points. A skill optimized for Claude Opus may actively harm Claude Haiku. Without per-model quality scoring, developers deploy capability-degrading instructions and have no way to know.

Claude's own documentation warns developers to “exercise extreme caution” with untrusted skills — but provides no mechanism to actually verify that trust. The gap between warning and tooling is where risk lives.

Capability	Claude	Codex	Manus	Cursor	Yawn
Trust Score	×	×	×	×	✓
Quality Verification	×	×	×	×	✓
Safety Scanning	×	×	×	×	✓
Cross-Model Testing	×	×	×	×	✓
Governance Gates	×	×	×	×	✓
Drift Detection	×	×	×	×	✓
Evidence/Proof Layer	×	×	×	×	✓
Economic Model	×	×	×	×	✓

The Yawn Skill Trust Score

The Yawn Skill Trust Score evaluates every skill file across 35+ checks organized into four weighted dimensions. The score produces a composite 0–100 value displayed as an F-to-A+ letter grade — universally understood, screenshot-worthy, and competitive.

Functional (25% weight)

“Does the skill define what it does, when to trigger, what it needs, and what it produces?”

Name and description fieldsPurpose statementInputs & outputsActivation triggerCode examplesScript referencesProgressive disclosure complianceImperative instructions

Quality (20% weight)

“Is it well-structured, properly formatted, and documented with examples?”

YAML frontmatterDescription within spec limitNo XML tags in metadataHeading hierarchyExamples sectionStep-by-step instructionsContent depth

Safety (30% weight)

“Does it handle errors, define constraints, avoid dangerous patterns, and respect boundaries?”

Safety constraints definedError handling guidanceNo code injection patterns (eval/exec)No risky external URL fetchesNo exposed credentialsNo privilege escalationScope boundariesData privacy awareness

Maintenance (25% weight)

“Is it versioned, cross-platform compatible, tested, and attributed?”

Version trackingModel compatibilityTesting/evidenceLicenseAuthor attributionDependencies documentedCross-platform portabilityDocument structure

How scoring works

Each dimension score is the percentage of checks passed within that dimension. The composite score uses weighted aggregation: Functional × 0.25 + Quality × 0.20 + Safety × 0.30 + Maintenance × 0.25.

Safety floor: If the safety score falls below 30, the composite is capped at D regardless of how well other dimensions score. An unsafe skill that “works” is worse than a safe skill that doesn't yet.

The heuristic engine runs entirely in the browser — zero API calls, instant results, no file upload. The optional Deep Scan sends content to an LLM for semantic analysis of instruction clarity, cross-model compatibility, and hidden safety gaps that static checks cannot detect.

The .yawn Holarchy Model

The missing piece in every existing agent skill standard is holonic composition — skills that contain sub-skills, forming a navigable tree. In the .yawn system, the skill holarchy is simultaneously:

Composition — which skills are needed to compose a capability
Navigation — the URL path maps directly to the tree position
Permission — depth in the tree determines autonomy requirements

yawn.ai/yawn/skills

├── core/ — fundamental capabilities (L4 autonomous)

│ ├── llm-invoke/ ├── json-parse/ ├── rate-limit/

├── perception/ — input understanding (L3-4)

│ ├── entity-extract/ ├── intent-classify/

├── safety/ — guardrails (L1-2 human required)

│ ├── risk-assess/ ├── kernel-pdp/

└── orchestration/ — coordination (L2-3)

└── desired-outcome-matcher/

Each skill tracks a loop status: green (loop closed — active, handler, evidence), yellow (partial — exists but incomplete), red (gap — stub, planned, or blocked). Unmatched slots are lacunae — explicitly identified gaps that the system surfaces for resolution.

Try the Desired Outcome Matcher →Full Documentation →

Why This Matters Now

The EU AI Act's high-risk system requirements take full effect August 2, 2026 — demanding documented risk management, technical transparency, human oversight, and conformity assessments. Penalties reach €35M or 7% of global revenue. For enterprises deploying agents with skills in regulated industries, every consumed skill becomes part of the compliance surface.

Gartner predicts over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear business value, or inadequate risk controls. The “inadequate risk controls” signal is the market opening for trust infrastructure.

The AI agent market reached $5.4 billion in 2024 and is projected to hit $236 billion by 2034 (45.8% CAGR). AI agents could mediate $3–5 trillion in global consumer commerce by 2030 according to McKinsey. The ecosystem has ~100,000 published skills, growing exponentially, with no quality immune system.

The FICO analogy is precise: credit scoring transformed lending from guesswork to data-driven risk assessment. Trust scoring will do the same for agentic AI. The first trust layer to become the default — the way npm audit became default, the way Dependabot became default on GitHub — wins the market.

The window is approximately 12–18 months before the market consolidates. The message that makes non-adoption feel irresponsible is straightforward: “Your AI agents have access to your codebase, your APIs, and your customer data. They're consuming skills published by strangers with no proof of quality. Would you run npm packages without npm audit?”

Vocabulary

New categories require new language. These terms define the trust layer for agent skills.

Skill Trust Score

A composite F-to-A+ grade derived from four weighted dimensions (Functional, Quality, Safety, Maintenance). Computed via 35+ heuristic checks and optional AI deep scan. The number your team optimizes for.

Trust Gate

A CI/CD checkpoint that blocks unscored or low-trust skills from reaching production agents. Analogous to npm audit or SonarQube's Quality Gate — pipeline halts if the skill fails the trust threshold.

Skill Drift

When a skill's effectiveness degrades silently as the underlying models update. A skill tested on Claude Sonnet 3.5 may produce different (worse) results on Sonnet 4 without any change to the skill itself.

Trust Debt

The accumulated risk from unverified skills consuming agent-level permissions. Like technical debt, it compounds — each unscored skill is a liability with access to your codebase, APIs, and customer data.

Scored Skill

A skill with a verified trust score. The new baseline. Anything without a score is unscored — and that is a choice you are making about what gets access to your systems.

Deep Scan

An AI-powered semantic analysis tier that goes beyond heuristic checks. Uses an LLM to evaluate instruction clarity, cross-model compatibility, hidden safety gaps, and architectural quality that static checks cannot detect.

Frequently Asked Questions

Score your first skill free

Drop a SKILL.md or .skill file. Get 35+ trust checks across four dimensions. Results in seconds. No signup.

✦ Score a Skill Enterprise waitlist