The honest data on Claude Code Skills

Your Skills are probably not firing.

Claude Code auto-invokes Skills only when your wording matches. On intent, and on long sessions, activation drops toward zero. Here is the data, and the fix Anthropic itself recommends.

~50 to 55%baseline Skill auto-activation on Sonnet 4.5. A coin flip.

Install ultra-powers Browse the library

What Anthropic says Skills do

Skills are meant to fire on their own. Anthropic is explicit about it.

“Claude automatically invokes relevant skills based on your task, no manual selection needed.”
Anthropic, Introducing Agent Skills

The mechanism: at startup Claude pre-loads the name and description of every installed Skill into its system prompt, about 100 tokens each, and uses that metadata to decide whether to trigger each one. The full SKILL.md body is not loaded until the Skill is triggered.

The key word is decide. Activation is a model judgment, not a deterministic trigger.

“If Claude thinks the skill is relevant to the current task, it will load the skill.”
Anthropic engineering

It is probabilistic by design.

What actually happens

That probabilistic design has a measurable cost.

It is keyword-dependent. In sandboxed evals, prompts that contain explicit Skill keywords activate the right Skill almost every time, generic phrasing drops to roughly 20 to 40 percent, and indirect or conceptual phrasing falls to near zero. That single finding explains the widely cited “around 20 percent” figure: it is what you get when you describe intent without naming the Skill.

Baseline is a coin flip. In the same evals, baseline auto-activation with no intervention was about 55 percent and 50 percent across two Sonnet 4.5 runs, and basically zero on Haiku 4.5.

Anthropic documents it as a known failure mode. The official Claude Code docs include a dedicated “Skill not triggering” troubleshooting section. The advice: rephrase to match the description, or fall back to invoking it manually with a slash command.

In our own usage the gap is stark. In one real 76-prompt session with no routing layer, Claude used zero Skills across the entire session. The same kind of workload with Flowy active auto-invoked 19. One paired observation from our transcripts, not a controlled trial. Directional, not proof.

~50 to 55%

Baseline auto-activation on Sonnet 4.5

Two sandboxed eval runs, no intervention.

controlled eval, third-party

~0%

Activation on indirect or conceptual phrasing

Describing intent without naming the Skill.

controlled eval, third-party

~20 to 40%

Activation on generic phrasing

Explains the widely cited 'about 20 percent' figure.

controlled eval, third-party

22 of 22

Standard prompts with a forced-evaluation hook

But plateaus at 67 to 75% on hard, indirect prompts. Not a silver bullet.

controlled eval, third-party

0 vs 19

Skills fired in a 76-prompt session: native vs Flowy

One paired observation from our own transcripts. Directional, not proof.

in our usage, directional, single paired anecdote

38% to 100%

Routing adherence once the agent reads the FLOW.md

Terse banner vs banner plus a mandatory read. Our banner pilot, n=8 per arm, single task and model.

first-party pilot, small n, directional

25,000 tokens

Combined Skill budget kept after compaction

First 5,000 tokens of each, most-recent-first. Older skills get dropped.

documented limit, Anthropic

Why it degrades over long sessions

Even when a Skill fires early, its influence fades.

Anthropic's docs spell out the mechanism: after auto-compaction, Claude Code re-attaches only the most recent invocation of each Skill, keeping the first 5,000 tokens of each within a combined 25,000-token budget, filled most-recent-first. Older skills can be dropped entirely after compaction if you have invoked many in one session.

“If a skill seems to stop influencing behavior after the first response, the content is usually still present and the model is choosing other tools or approaches.”
Claude Code docs, Skills

The complaint that Claude forgets your Skills exist on long sessions is not folklore. It is a documented limit.

The fix, straight from Anthropic

Here is the part most people miss.

Anthropic's own troubleshooting guidance names the durable fix: strengthen the skill's description and instructions, or use hooks to enforce behavior deterministically, and if the skill is large or you invoked several others after it, re-invoke it after compaction to restore the full content.

“Use hooks to enforce behavior deterministically. If the skill is large or you invoked several others after it, re-invoke it after compaction to restore the full content.”
Claude Code docs, Skills (troubleshooting)

Two levers: a hook for deterministic enforcement, and re-invocation after compaction. The community arrived at the same place independently. A forced-evaluation hook pushed activation to a perfect 22 of 22 on standard prompts in controlled evals. It is not a silver bullet: on harder, indirectly phrased prompts the same hooks plateaued around 67 to 75 percent. Hooks fix the “forgot to look” problem. They do not fully fix ambiguous phrasing.

Where Flowy fits

Flowy is Anthropic's recommendation, productized, plus the part the docs do not give you: the hand-pick.

The hook, built in. The unit Flowy ships is a Flow: a plugin of hand-picked skills, auto-invoked at the right moment. Installing one installs a UserPromptSubmit hook that re-asserts the routing decision on every prompt, so the right Skill keeps firing instead of going dormant.

Survives compaction. Flowy re-reads its routing document after a compaction event, exactly the re-invoke-after-compaction step Anthropic describes, done for you.

Hand-picking, not just enforcement. A hook gets a Skill to fire. Flowy's FLOW.md decides which Skill should fire when several overlap, and ships a hand-picked, disambiguated set. That is what ultra-powers is: 40 skills across build, ship, and grow, behind one router.

Auto-activation is real, and for keyword-y requests it already works. The gap is everything else: intent without the keyword, and long sessions where Skills quietly drop out. That gap is where a routing layer earns its place.

Side by side

What actually differs.

Scenario	Native (no layer)	Forced-eval hook	Flowy
Keyword prompts	~100%	100%	100%
Generic / intent prompts	~20 to 40%	improved	routed by FLOW.md
Indirect / conceptual	~0%	67 to 75%	67 to 75% plus disambiguation
Survives compaction	no (skills drop)	if re-invoked	yes, auto re-read
Picks the right skill when many overlap	model guess	no	yes, hand-picked FLOW.md

Hook columns reflect controlled evals on the forced-evaluation pattern. Flowy's compaction and disambiguation rows describe its mechanism, not a head-to-head benchmark.

FAQ

Do Claude Code Skills auto-activate?

Yes, by design. Claude pre-loads each Skill's name and description and decides on its own whether to trigger one, with no manual selection. But that decision is a probabilistic model judgment, not a deterministic rule, so it is reliable only when your wording matches the Skill.

How reliable is skill auto-activation?

In sandboxed evals, baseline activation on Sonnet 4.5 was about 50 to 55 percent, dropped to roughly 20 to 40 percent on generic phrasing, and fell to near zero on indirect or conceptual requests. On Haiku 4.5 it was basically zero. Keyword-matching prompts activate almost every time.

Why does Claude forget my Skills on long sessions?

It is a documented limit. After auto-compaction, Claude Code keeps only the most recent invocation of each Skill, the first 5,000 tokens of each within a combined 25,000-token budget, most-recent-first. Skills invoked earlier can be dropped entirely.

How do you make Skills fire reliably?

Anthropic's own docs recommend two levers: use hooks to enforce behavior deterministically, and re-invoke a Skill after compaction to restore its full content. A forced-evaluation hook reached 22 of 22 on standard prompts in controlled evals, though it plateaued at 67 to 75 percent on hard, indirectly phrased prompts.

What is Flowy and how does it help?

Flowy productizes Anthropic's recommendation and adds the hand-pick. Installing a Flow installs a UserPromptSubmit hook that re-asserts the routing decision every prompt and re-reads its routing document after compaction. Its FLOW.md also decides which Skill should fire when several overlap. ultra-powers is one such Flow, 40 hand-picked skills behind one router.

Does a hook guarantee a Skill activates?

No. Hooks fix the 'the model forgot to look' problem, and they push standard-prompt activation to near perfect. But on ambiguous or indirectly phrased requests the same hooks plateau around 67 to 75 percent. Enforcement plus hand-picking closes more of the gap than enforcement alone.

Try it

Three lines in Claude Code.

claude-code

> /plugin marketplace add MaximoCorrea1/ultra-powers
> /plugin install ultra-powers@ultra-powers
> /ultra-powers:ultra-powers

Prefer to start smaller? Browse the hand-picked library.

Honesty note

Community-reported figures (the “20 percent / coin flip”) are corroborated by the controlled evals cited here but are not Anthropic-published numbers. Our 76-vs-19 figure is a single paired observation from our own transcripts, framed as directional. The Anthropic quotes are verbatim from official docs, verified in adversarial review.

Sources

The honest data on Claude Code Skills

Your Skills are probably not firing.

Claude Code auto-invokes Skills only when your wording matches. On intent, and on long sessions, activation drops toward zero. Here is the data, and the fix Anthropic itself recommends.

~50 to 55%baseline Skill auto-activation on Sonnet 4.5. A coin flip.

Install ultra-powers Browse the library

What Anthropic says Skills do

Skills are meant to fire on their own. Anthropic is explicit about it.

“Claude automatically invokes relevant skills based on your task, no manual selection needed.”
Anthropic, Introducing Agent Skills

The key word is decide. Activation is a model judgment, not a deterministic trigger.

“If Claude thinks the skill is relevant to the current task, it will load the skill.”
Anthropic engineering

It is probabilistic by design.

What actually happens

That probabilistic design has a measurable cost.

Baseline is a coin flip. In the same evals, baseline auto-activation with no intervention was about 55 percent and 50 percent across two Sonnet 4.5 runs, and basically zero on Haiku 4.5.

~50 to 55%

Baseline auto-activation on Sonnet 4.5

Two sandboxed eval runs, no intervention.

controlled eval, third-party

~0%

Activation on indirect or conceptual phrasing

Describing intent without naming the Skill.

controlled eval, third-party

~20 to 40%

Activation on generic phrasing

Explains the widely cited 'about 20 percent' figure.

controlled eval, third-party

22 of 22

Standard prompts with a forced-evaluation hook

But plateaus at 67 to 75% on hard, indirect prompts. Not a silver bullet.

controlled eval, third-party

0 vs 19

Skills fired in a 76-prompt session: native vs Flowy

One paired observation from our own transcripts. Directional, not proof.

in our usage, directional, single paired anecdote

38% to 100%

Routing adherence once the agent reads the FLOW.md

Terse banner vs banner plus a mandatory read. Our banner pilot, n=8 per arm, single task and model.

first-party pilot, small n, directional

25,000 tokens

Combined Skill budget kept after compaction

First 5,000 tokens of each, most-recent-first. Older skills get dropped.

documented limit, Anthropic

Why it degrades over long sessions

Even when a Skill fires early, its influence fades.

“If a skill seems to stop influencing behavior after the first response, the content is usually still present and the model is choosing other tools or approaches.”
Claude Code docs, Skills

The complaint that Claude forgets your Skills exist on long sessions is not folklore. It is a documented limit.

The fix, straight from Anthropic

Here is the part most people miss.

“Use hooks to enforce behavior deterministically. If the skill is large or you invoked several others after it, re-invoke it after compaction to restore the full content.”
Claude Code docs, Skills (troubleshooting)

Where Flowy fits

Flowy is Anthropic's recommendation, productized, plus the part the docs do not give you: the hand-pick.

Survives compaction. Flowy re-reads its routing document after a compaction event, exactly the re-invoke-after-compaction step Anthropic describes, done for you.

Side by side

What actually differs.

Scenario	Native (no layer)	Forced-eval hook	Flowy
Keyword prompts	~100%	100%	100%
Generic / intent prompts	~20 to 40%	improved	routed by FLOW.md
Indirect / conceptual	~0%	67 to 75%	67 to 75% plus disambiguation
Survives compaction	no (skills drop)	if re-invoked	yes, auto re-read
Picks the right skill when many overlap	model guess	no	yes, hand-picked FLOW.md

Hook columns reflect controlled evals on the forced-evaluation pattern. Flowy's compaction and disambiguation rows describe its mechanism, not a head-to-head benchmark.

FAQ

Do Claude Code Skills auto-activate?

How reliable is skill auto-activation?

Why does Claude forget my Skills on long sessions?

How do you make Skills fire reliably?

What is Flowy and how does it help?

Does a hook guarantee a Skill activates?

Try it

Three lines in Claude Code.

claude-code

> /plugin marketplace add MaximoCorrea1/ultra-powers
> /plugin install ultra-powers@ultra-powers
> /ultra-powers:ultra-powers

Prefer to start smaller? Browse the hand-picked library.

Honesty note

Sources