spark-optimization
by wshobsonspark-optimization is a practical guide to diagnosing slow Apache Spark jobs with partitioning, shuffle, skew, caching, and memory tuning. Use it to install the skill from wshobson/agents, read SKILL.md, and apply evidence-based fixes from Spark UI symptoms, cluster settings, and query patterns.
This skill scores 78/100, which makes it a solid directory listing candidate for users who want Spark performance-tuning guidance an agent can invoke with reasonable confidence. The repository evidence shows real operational content around partitioning, caching, shuffle reduction, memory tuning, and debugging slow jobs, so it should help agents do better than a generic prompt. The main limitation is that it is documentation-only and lacks companion scripts, references, or install/run scaffolding, so users should expect advisory patterns rather than an executable optimization toolkit.
- Strong triggerability: the description and "When to Use" section clearly cover slow jobs, memory tuning, partitioning, skew, and scaling scenarios.
- Substantial workflow content: the long SKILL.md includes quick-start code fences plus concrete optimization topics like shuffle, serialization, memory, and partition strategy.
- Good agent leverage: it packages production Spark optimization patterns into one place, reducing guesswork versus a broad generic performance prompt.
- Documentation-only skill: no scripts, references, assets, or install command are provided, so execution depends on the agent translating guidance into the user's environment.
- Practical validation signals are thin: structural evidence shows limited explicit constraints/practical signals, which may leave edge-case tuning choices less grounded.
Overview of spark-optimization skill
What spark-optimization is for
The spark-optimization skill is a focused guide for diagnosing and improving Apache Spark job performance. It is best used when a pipeline is already working functionally, but runs too slowly, spills heavily, shuffles too much, suffers from skew, or becomes expensive at scale. Instead of giving generic “make Spark faster” advice, it organizes optimization around the execution model: jobs, stages, tasks, partitions, memory, and shuffle behavior.
Who should use this spark-optimization skill
This skill fits data engineers, analytics engineers, platform teams, and agent users who need practical help with Spark SQL or PySpark performance tuning. It is especially useful if you can provide some context about the workload, such as join patterns, data size, partition counts, executor settings, or symptoms from the Spark UI. If you are a beginner, it still helps, but it works best when paired with concrete job details rather than a blank request.
The real job-to-be-done
Most users are not trying to “optimize Spark” in the abstract. They are trying to answer questions like:
- Why is this join stage much slower than the rest?
- Why are executors spilling or dying with OOM?
- Why did this pipeline get slower after data volume increased?
- Should I repartition, cache, broadcast, or change executor settings?
- How do I reduce shuffle cost without changing business logic?
The spark-optimization skill is valuable because it frames those as diagnosable performance problems, not just code-style issues.
What makes this different from a generic prompt
A generic model prompt often lists every common Spark tuning tip at once. The spark-optimization skill is better suited to a narrower workflow: understand execution behavior first, then choose the right lever. Its strongest differentiators are its emphasis on partitioning strategy, shuffle reduction, memory tuning, and the Spark execution model that explains why a change should help.
How to Use spark-optimization skill
Install context for spark-optimization
This skill lives in the wshobson/agents repository under:
plugins/data-engineering/skills/spark-optimization
If your skill runner supports direct GitHub installs, a common pattern is:
npx skills add https://github.com/wshobson/agents --skill spark-optimization
If your environment uses a different installer, add the skill from that repository and select the spark-optimization slug. The repository evidence only shows SKILL.md, so expect the guidance to be document-driven rather than backed by helper scripts or reference files.
Read this file first
Start with:
plugins/data-engineering/skills/spark-optimization/SKILL.md
Because this skill has no visible supporting scripts, rules, or references, most of the value is in reading the main optimization framework carefully. Do not expect automation that inspects your Spark UI or cluster for you; you need to bring the runtime facts.
What input the skill needs to work well
The spark-optimization usage quality depends heavily on the detail you provide. Useful inputs include:
- Spark API used: PySpark DataFrame, Spark SQL, Scala Spark, streaming, batch
- Data size: row counts, file counts, partition sizes, growth pattern
- Cluster shape: executor memory, cores, number of executors, autoscaling behavior
- Symptoms: skewed tasks, long shuffle read, spills, OOMs, stragglers, GC pauses
- Query pattern: joins, aggregations, UDFs, window functions, repartitioning
- Storage format: Parquet, Delta, Iceberg, CSV, JSON
- Spark UI observations: slow stage IDs, shuffle size, task distribution, input size
- Current config:
spark.sql.shuffle.partitions, broadcast thresholds, serializer
Without these, the model will fall back to generic suggestions.
Turn a rough goal into a strong prompt
Weak prompt:
Optimize my Spark job.
Stronger prompt:
Use the spark-optimization skill to diagnose a slow PySpark batch job. It reads 1.2 TB of Parquet, joins a 1.2 TB fact table with a 20 MB dimension table and a 40 GB history table, then aggregates by customer and day. The slowest stage shows high shuffle read and a few tasks run 10x longer than others. Cluster: 20 executors, 8 cores each, 32 GB memory each. Current
spark.sql.shuffle.partitions=200. Suggest likely bottlenecks, code-level fixes, and config changes in priority order, with tradeoffs.
That prompt gives the skill enough evidence to reason about broadcast joins, skew, partition sizing, and shuffle pressure.
A practical spark-optimization workflow
A good workflow is:
- Describe the workload and symptom.
- Share the slow stage characteristics from the Spark UI.
- Ask for bottleneck ranking, not just random tuning tips.
- Request changes in layers:
- query/code changes
- partitioning changes
- join strategy changes
- memory/config changes
- Apply the smallest high-confidence fix first.
- Re-run and compare stage-level behavior.
This keeps the spark-optimization guide grounded in measurable outcomes rather than cargo-cult tuning.
What the skill is strongest at
The repository text clearly centers on these areas:
- partitioning strategy
- caching choices
- shuffle optimization
- memory tuning
- scaling Spark pipelines
- reducing data skew
In practice, that means the skill is most helpful for batch performance work where stage structure and data movement are the main cost drivers.
What to ask for explicitly
Ask the skill to produce outputs that are easy to act on, such as:
- a bottleneck hypothesis table
- recommended config changes with expected effect
- likely skew points and validation checks
- whether a broadcast join is appropriate
- whether
repartitionorcoalesceis more suitable - cache or persist recommendations with justification
- anti-patterns in your transformations
This is much better than asking for a generic “best practices list.”
Repository-backed topics to focus on
Based on the visible source, the best areas to invoke in prompts are:
- Spark execution model
- key performance factors
- shuffle minimization
- data skew handling
- serialization choices
- memory pressure reduction
- partition right-sizing
Those are the concepts the skill is explicitly structured around, so you are more likely to get aligned output by naming them.
Practical prompt template for spark-optimization for Performance Optimization
Use a prompt like this:
Use the spark-optimization skill for Performance Optimization. Analyze this Spark job for shuffle, skew, memory, and partition issues. Here is the job summary: [pipeline description]. Here are the slow stages and Spark UI symptoms: [details]. Here is current cluster config: [details]. Here is a simplified code excerpt: [code]. Recommend the top 3 changes by expected impact, explain why each helps in Spark’s execution model, and note any tradeoffs or validation steps.
That phrasing tends to produce more decision-ready output than “improve performance.”
spark-optimization skill FAQ
Is spark-optimization useful without Spark UI data
Yes, but results are less reliable. The skill can still reason from code shape, join sizes, file layout, and cluster settings. However, if you want confident recommendations for skew, shuffle hotspots, or memory pressure, Spark UI evidence materially improves the answer.
Is this spark-optimization skill for beginners
Yes, if you already know basic Spark concepts. The skill explains optimization through jobs, stages, tasks, partitions, and shuffle behavior, which helps intermediate users level up quickly. Absolute beginners may still need a basic Spark primer before the recommendations feel actionable.
When should I not use spark-optimization
Do not use spark-optimization as your main tool if the problem is really:
- incorrect business logic
- bad source data quality
- orchestration failures
- storage-layer outages
- non-Spark bottlenecks outside the engine
It is also a poor fit if you cannot share any runtime facts and only want a generic tuning checklist.
How is this different from normal Spark advice
Ordinary prompts often produce broad advice like “cache more,” “increase partitions,” or “use broadcast joins.” The spark-optimization install decision is worthwhile if you want a workflow centered on diagnosing which of those levers matters first and why, using Spark execution concepts rather than disconnected tips.
Does the skill replace benchmarking
No. It improves the quality of your hypotheses and recommended changes, but you still need to test before and after. Spark optimization is highly workload-dependent, and some changes that help one stage can hurt another.
How to Improve spark-optimization skill
Give symptoms, not just goals
The fastest way to improve spark-optimization usage is to replace abstract goals with specific symptoms:
- “stage 14 has 2 tasks running 12 minutes while others finish in 40 seconds”
- “shuffle read is 600 GB after a groupBy”
- “executors spill to disk during sortMergeJoin”
- “job got slower after increasing daily input from 200 GB to 900 GB”
These clues let the skill map directly to skew, partitioning, memory, or shuffle causes.
Include the code shape that creates the bottleneck
You do not need to paste your whole project. A small excerpt showing:
- joins
- aggregations
- repartition calls
- cache or persist usage
- Python UDFs
- file reads and writes
is usually enough. Spark performance problems are often visible from transformation shape, not just config.
Ask for priority-ranked recommendations
A common failure mode is getting ten suggestions with no order. Ask the skill to rank fixes by:
- expected performance impact
- implementation effort
- operational risk
That forces sharper output and makes adoption easier.
Request validation steps after each recommendation
Strong spark-optimization guide usage includes verification. Ask the skill how to confirm each fix worked, such as:
- reduced shuffle read/write
- more even task durations
- fewer spills
- lower GC time
- fewer stages
- better input partition balance
This prevents “tuning by folklore.”
Provide constraints so the advice is realistic
Mention constraints like:
- cannot increase cluster size
- cannot rewrite in Scala
- must keep current table format
- latency target is fixed
- broadcast join memory is limited
- only config changes are allowed this sprint
The skill will give much more usable advice when it knows what is off-limits.
Iterate after the first spark-optimization output
After the first pass, come back with:
- what changed
- what improved
- what got worse
- updated Spark UI observations
This is where the skill becomes genuinely useful. Spark optimization is iterative; the best next step often depends on the side effects of the previous change.
Watch for common misfires
The most common low-quality outcomes happen when users:
- ask for optimization with no workload details
- ignore skew and focus only on executor memory
- over-cache large datasets without reuse evidence
- change partition counts blindly
- treat every join as a broadcast candidate
- optimize configs before fixing expensive transformations
If you want better results, make the skill reason from evidence, not from generic preferences.
