spark-optimization

by wshobson

spark-optimization is a practical guide to diagnosing slow Apache Spark jobs with partitioning, shuffle, skew, caching, and memory tuning. Use it to install the skill from wshobson/agents, read SKILL.md, and apply evidence-based fixes from Spark UI symptoms, cluster settings, and query patterns.

Stars32.6k

Favorites0

Comments0

AddedMar 30, 2026

CategoryPerformance Optimization

Install Command

npx skills add wshobson/agents --skill spark-optimization

Curation Score

This skill scores 78/100, which makes it a solid directory listing candidate for users who want Spark performance-tuning guidance an agent can invoke with reasonable confidence. The repository evidence shows real operational content around partitioning, caching, shuffle reduction, memory tuning, and debugging slow jobs, so it should help agents do better than a generic prompt. The main limitation is that it is documentation-only and lacks companion scripts, references, or install/run scaffolding, so users should expect advisory patterns rather than an executable optimization toolkit.

78/100

Strengths

Strong triggerability: the description and "When to Use" section clearly cover slow jobs, memory tuning, partitioning, skew, and scaling scenarios.
Substantial workflow content: the long SKILL.md includes quick-start code fences plus concrete optimization topics like shuffle, serialization, memory, and partition strategy.
Good agent leverage: it packages production Spark optimization patterns into one place, reducing guesswork versus a broad generic performance prompt.

Cautions

Documentation-only skill: no scripts, references, assets, or install command are provided, so execution depends on the agent translating guidance into the user's environment.
Practical validation signals are thin: structural evidence shows limited explicit constraints/practical signals, which may leave edge-case tuning choices less grounded.

Python Data Engineering Data Pipelines Performance Sql

Overview

Overview of spark-optimization skill

What spark-optimization is for

The spark-optimization skill is a focused guide for diagnosing and improving Apache Spark job performance. It is best used when a pipeline is already working functionally, but runs too slowly, spills heavily, shuffles too much, suffers from skew, or becomes expensive at scale. Instead of giving generic “make Spark faster” advice, it organizes optimization around the execution model: jobs, stages, tasks, partitions, memory, and shuffle behavior.

Who should use this spark-optimization skill

This skill fits data engineers, analytics engineers, platform teams, and agent users who need practical help with Spark SQL or PySpark performance tuning. It is especially useful if you can provide some context about the workload, such as join patterns, data size, partition counts, executor settings, or symptoms from the Spark UI. If you are a beginner, it still helps, but it works best when paired with concrete job details rather than a blank request.

The real job-to-be-done

Most users are not trying to “optimize Spark” in the abstract. They are trying to answer questions like:

Why is this join stage much slower than the rest?
Why are executors spilling or dying with OOM?
Why did this pipeline get slower after data volume increased?
Should I repartition, cache, broadcast, or change executor settings?
How do I reduce shuffle cost without changing business logic?

The spark-optimization skill is valuable because it frames those as diagnosable performance problems, not just code-style issues.

What makes this different from a generic prompt

A generic model prompt often lists every common Spark tuning tip at once. The spark-optimization skill is better suited to a narrower workflow: understand execution behavior first, then choose the right lever. Its strongest differentiators are its emphasis on partitioning strategy, shuffle reduction, memory tuning, and the Spark execution model that explains why a change should help.

How to Use spark-optimization skill

Install context for spark-optimization

This skill lives in the wshobson/agents repository under:

plugins/data-engineering/skills/spark-optimization

If your skill runner supports direct GitHub installs, a common pattern is:

npx skills add https://github.com/wshobson/agents --skill spark-optimization

If your environment uses a different installer, add the skill from that repository and select the spark-optimization slug. The repository evidence only shows SKILL.md, so expect the guidance to be document-driven rather than backed by helper scripts or reference files.

Read this file first

Start with:

plugins/data-engineering/skills/spark-optimization/SKILL.md

Because this skill has no visible supporting scripts, rules, or references, most of the value is in reading the main optimization framework carefully. Do not expect automation that inspects your Spark UI or cluster for you; you need to bring the runtime facts.

What input the skill needs to work well

The spark-optimization usage quality depends heavily on the detail you provide. Useful inputs include:

Spark API used: PySpark DataFrame, Spark SQL, Scala Spark, streaming, batch
Data size: row counts, file counts, partition sizes, growth pattern
Cluster shape: executor memory, cores, number of executors, autoscaling behavior
Symptoms: skewed tasks, long shuffle read, spills, OOMs, stragglers, GC pauses
Query pattern: joins, aggregations, UDFs, window functions, repartitioning
Storage format: Parquet, Delta, Iceberg, CSV, JSON
Spark UI observations: slow stage IDs, shuffle size, task distribution, input size
Current config: spark.sql.shuffle.partitions, broadcast thresholds, serializer

Without these, the model will fall back to generic suggestions.

Turn a rough goal into a strong prompt

Weak prompt:

Optimize my Spark job.

Stronger prompt:

Use the spark-optimization skill to diagnose a slow PySpark batch job. It reads 1.2 TB of Parquet, joins a 1.2 TB fact table with a 20 MB dimension table and a 40 GB history table, then aggregates by customer and day. The slowest stage shows high shuffle read and a few tasks run 10x longer than others. Cluster: 20 executors, 8 cores each, 32 GB memory each. Current spark.sql.shuffle.partitions=200. Suggest likely bottlenecks, code-level fixes, and config changes in priority order, with tradeoffs.

That prompt gives the skill enough evidence to reason about broadcast joins, skew, partition sizing, and shuffle pressure.

A practical spark-optimization workflow

A good workflow is:

Describe the workload and symptom.
Share the slow stage characteristics from the Spark UI.
Ask for bottleneck ranking, not just random tuning tips.
Request changes in layers:
- query/code changes
- partitioning changes
- join strategy changes
- memory/config changes
Apply the smallest high-confidence fix first.
Re-run and compare stage-level behavior.

This keeps the spark-optimization guide grounded in measurable outcomes rather than cargo-cult tuning.

What the skill is strongest at

The repository text clearly centers on these areas:

partitioning strategy
caching choices
shuffle optimization
memory tuning
scaling Spark pipelines
reducing data skew

In practice, that means the skill is most helpful for batch performance work where stage structure and data movement are the main cost drivers.

What to ask for explicitly

Ask the skill to produce outputs that are easy to act on, such as:

a bottleneck hypothesis table
recommended config changes with expected effect
likely skew points and validation checks
whether a broadcast join is appropriate
whether repartition or coalesce is more suitable
cache or persist recommendations with justification
anti-patterns in your transformations

This is much better than asking for a generic “best practices list.”

Repository-backed topics to focus on

Based on the visible source, the best areas to invoke in prompts are:

Spark execution model
key performance factors
shuffle minimization
data skew handling
serialization choices
memory pressure reduction
partition right-sizing

Those are the concepts the skill is explicitly structured around, so you are more likely to get aligned output by naming them.

Practical prompt template for spark-optimization for Performance Optimization

Use a prompt like this:

Use the spark-optimization skill for Performance Optimization. Analyze this Spark job for shuffle, skew, memory, and partition issues. Here is the job summary: [pipeline description]. Here are the slow stages and Spark UI symptoms: [details]. Here is current cluster config: [details]. Here is a simplified code excerpt: [code]. Recommend the top 3 changes by expected impact, explain why each helps in Spark’s execution model, and note any tradeoffs or validation steps.

That phrasing tends to produce more decision-ready output than “improve performance.”

spark-optimization skill FAQ

Is spark-optimization useful without Spark UI data

Yes, but results are less reliable. The skill can still reason from code shape, join sizes, file layout, and cluster settings. However, if you want confident recommendations for skew, shuffle hotspots, or memory pressure, Spark UI evidence materially improves the answer.

Is this spark-optimization skill for beginners

Yes, if you already know basic Spark concepts. The skill explains optimization through jobs, stages, tasks, partitions, and shuffle behavior, which helps intermediate users level up quickly. Absolute beginners may still need a basic Spark primer before the recommendations feel actionable.

When should I not use spark-optimization

Do not use spark-optimization as your main tool if the problem is really:

incorrect business logic
bad source data quality
orchestration failures
storage-layer outages
non-Spark bottlenecks outside the engine

It is also a poor fit if you cannot share any runtime facts and only want a generic tuning checklist.

How is this different from normal Spark advice

Ordinary prompts often produce broad advice like “cache more,” “increase partitions,” or “use broadcast joins.” The spark-optimization install decision is worthwhile if you want a workflow centered on diagnosing which of those levers matters first and why, using Spark execution concepts rather than disconnected tips.

Does the skill replace benchmarking

No. It improves the quality of your hypotheses and recommended changes, but you still need to test before and after. Spark optimization is highly workload-dependent, and some changes that help one stage can hurt another.

How to Improve spark-optimization skill

Give symptoms, not just goals

The fastest way to improve spark-optimization usage is to replace abstract goals with specific symptoms:

“stage 14 has 2 tasks running 12 minutes while others finish in 40 seconds”
“shuffle read is 600 GB after a groupBy”
“executors spill to disk during sortMergeJoin”
“job got slower after increasing daily input from 200 GB to 900 GB”

These clues let the skill map directly to skew, partitioning, memory, or shuffle causes.

Include the code shape that creates the bottleneck

You do not need to paste your whole project. A small excerpt showing:

joins
aggregations
repartition calls
cache or persist usage
Python UDFs
file reads and writes

is usually enough. Spark performance problems are often visible from transformation shape, not just config.

Ask for priority-ranked recommendations

A common failure mode is getting ten suggestions with no order. Ask the skill to rank fixes by:

expected performance impact
implementation effort
operational risk

That forces sharper output and makes adoption easier.

Request validation steps after each recommendation

Strong spark-optimization guide usage includes verification. Ask the skill how to confirm each fix worked, such as:

reduced shuffle read/write
more even task durations
fewer spills
lower GC time
fewer stages
better input partition balance

This prevents “tuning by folklore.”

Provide constraints so the advice is realistic

Mention constraints like:

cannot increase cluster size
cannot rewrite in Scala
must keep current table format
latency target is fixed
broadcast join memory is limited
only config changes are allowed this sprint

The skill will give much more usable advice when it knows what is off-limits.

Iterate after the first spark-optimization output

After the first pass, come back with:

what changed
what improved
what got worse
updated Spark UI observations

This is where the skill becomes genuinely useful. Spark optimization is iterative; the best next step often depends on the side effects of the previous change.

Watch for common misfires

The most common low-quality outcomes happen when users:

ask for optimization with no workload details
ignore skew and focus only on executor memory
over-cache large datasets without reuse evidence
change partition counts blindly
treat every join as a broadcast candidate
optimize configs before fixing expensive transformations

If you want better results, make the skill reason from evidence, not from generic preferences.

Ratings & Reviews

No ratings yet

Share your review

0/10000

Latest reviews

Saving...

more skill

vercel-react-best-practices

by vercel-labs

vercel-react-best-practices is a Vercel Engineering skill that guides AI agents to optimize React and Next.js performance with prioritized rules for waterfalls, bundle size, and rendering.

Frontend Development

Favorites 0GitHub 24k

performance-optimization

by addyosmani

The performance-optimization skill helps you measure first, find the real bottleneck, fix it, and verify results. Use it when performance requirements exist, you suspect a regression, or Core Web Vitals, load times, or interaction latency need improvement.

Performance Optimization

Favorites 0GitHub 18.7k

supabase-postgres-best-practices

by supabase

supabase-postgres-best-practices is a Supabase Postgres optimization skill for query tuning, indexing, schema design, RLS performance, locking, and connection management.

Database Engineering

Favorites 0GitHub 1.7k

wp-performance

by WordPress

Use wp-performance to investigate and improve WordPress performance from the backend, without a browser UI. It supports measurement-first diagnosis for slow frontend requests, admin pages, REST routes, and WP-Cron, with guidance on WP-CLI profile/doctor, Query Monitor via REST headers, Server-Timing, database queries, autoloaded options, object caching, cron, and remote HTTP calls.

Performance Optimization

Favorites 0GitHub 1.4k

web-perf

by cloudflare

web-perf analyzes web performance with Chrome DevTools MCP. It measures Core Web Vitals, trace-based load issues, render-blocking resources, layout shifts, caching problems, and accessibility gaps. Use the web-perf skill for Performance Optimization, debugging slow pages, and evidence-based web-perf guide workflows that rely on current docs and live traces.

Performance Optimization

Favorites 0GitHub 1.3k

react-native-best-practices

by callstackincubator

react-native-best-practices is a practical React Native performance optimization guide for slow startup, dropped frames, heavy renders, memory leaks, bundle bloat, and animation jank. Use it when you need evidence-backed fixes for Hermes, bridge overhead, FlashList, native modules, or profiling a release regression.

Performance Optimization

Favorites 0GitHub 1.3k

swift-nio

by Joannis

swift-nio is a skill for SwiftNIO backend development, covering servers, clients, pipelines, buffers, codecs, and event-loop-safe async code. Use it for swift-nio usage questions, protocol parsing, TCP/UDP services, NIOAsyncChannel integration, and debugging blocking work on an EventLoop. It is a practical swift-nio guide for correct architecture and implementation.

Backend Development

Favorites 0GitHub 0

audit-website

by squirrelscan

The audit-website skill uses the squirrel CLI to audit websites and webapps across 230+ rules for SEO, technical, content, performance, security, links, and site health, then returns actionable LLM-ready reports.

UX Audit

Favorites 0GitHub 68

autoresearch

by github

autoresearch is an autonomous experimentation loop for coding tasks with measurable outcomes. It helps developers define a goal, baseline, metric, and scope, then iterate through code changes, tests, and keep-or-revert decisions using git-backed checkpoints.

Workflow Automation

Favorites 0GitHub 0

godot-gdscript-patterns

by wshobson

godot-gdscript-patterns helps Godot 4 users generate and review GDScript with better scene structure, signals, state machines, autoloads, and async loading patterns. Use it to install proven Godot architecture into gameplay systems, UI flows, and maintainable project code.

Frontend Development

Favorites 0GitHub 32.5k

pytorch-patterns

by affaan-m

pytorch-patterns helps you write, review, and debug PyTorch code with device-agnostic patterns, reproducible experiments, and explicit tensor handling. Use the pytorch-patterns skill for cleaner training loops, model refactors, and practical PyTorch guidance.

Code Editing

Favorites 0GitHub 156.2k

nextjs-turbopack

by affaan-m

The nextjs-turbopack skill helps you use Turbopack in Next.js 16+ for faster local development, HMR, and bundler decisions. Use it as a practical nextjs-turbopack guide for install, usage, and when to switch back to webpack in Frontend Development workflows.

Frontend Development

Favorites 0GitHub 156.2k

jpa-patterns

by affaan-m

jpa-patterns is a practical JPA/Hibernate guide for Spring Boot backend development. It covers entity design, relationships, query tuning, transactions, auditing, pagination, and pooling to help reduce ORM mistakes and improve persistence performance.

Backend Development

Favorites 0GitHub 156.2k

rust-async-patterns

by wshobson

rust-async-patterns is a practical skill for async Rust with Tokio, covering tasks, channels, streams, timeouts, cancellation, tracing, and error handling for backend development.

Backend Development

Favorites 0GitHub 32.6k

go-concurrency-patterns

by wshobson

go-concurrency-patterns helps you apply idiomatic Go concurrency for worker pools, pipelines, channels, sync primitives, and context-based cancellation. Use it to design safer backend services, debug race conditions, and improve graceful shutdown behavior from the guidance in SKILL.md.

Backend Development

Favorites 0GitHub 32.6k

async-python-patterns

by wshobson

async-python-patterns is a practical guide to choosing safe asyncio patterns for I/O-bound Python systems. Use it to install context, review usage, avoid blocking the event loop, and design async APIs, workers, scrapers, and backend services with bounded concurrency, cancellation, and sync-vs-async tradeoffs.

Backend Development

Favorites 0GitHub 32.6k