Profiling¶
Nanosecond-precision profiler built into the native Zig extension. Zero overhead when disabled, sub-microsecond timing resolution when active.
Overview¶
HyperDjango's profiling system operates at three levels:
- Handler profiling --
@profile_handlerdecorator for individual route timing - Performance middleware --
PerformanceMiddlewarefor per-request query tracking, N+1 detection, and slow query logging - Manual profiling --
ProfilerandProfileStorefor custom instrumentation and flame graph generation
All timing uses the native Zig std.time.nanoTimestamp() function, which reads the hardware timestamp counter directly. This provides ~45 ns resolution with zero system call overhead, compared to Python's time.monotonic_ns() which goes through the kernel.
Nanosecond Timer¶
The profiler uses two native functions exposed from the Zig extension:
_profiler_nanos()-- returns the current monotonic nanosecond timestamp_profiler_diff_nanos(start)-- returns elapsed nanoseconds sincestart
These are wrapped in Python helpers:
from hyperdjango.profiling import nanos, elapsed_nanos
start = nanos()
# ... do work ...
duration = elapsed_nanos(start)
print(f"Took {duration} ns")
The timer resolution is approximately 45 ns on modern hardware. This means you can meaningfully profile operations down to the microsecond level -- useful for measuring individual SQL queries, template renders, and serialization passes.
@profile_handler¶
Profile individual route handlers by adding the @profile_handler decorator:
from hyperdjango.profiling import profile_handler
@app.get("/api/users")
@profile_handler
async def list_users(request):
users = await User.objects.all()
return {"users": [u.to_dict() for u in users]}
When a profiled handler executes, the decorator:
- Creates a
RequestProfilewith the request method and path - Records the handler start time in nanoseconds
- Executes the handler
- Records handler duration
- Attaches the profile as an
X-Profileheader on the response
The X-Profile header contains a human-readable timing breakdown:
The timing categories are:
| Category | What it measures |
|---|---|
total |
End-to-end request time |
handler |
Time spent in the route handler function |
sql |
Total database query time (with query count) |
middleware |
Time spent in middleware processing |
routing |
Time spent resolving the route |
X-Profile Header¶
The X-Profile header is added automatically by @profile_handler. You can also enable it per-request by passing ?profile=1 as a query parameter:
GET /api/users?profile=1 HTTP/1.1
HTTP/1.1 200 OK
X-Profile: total=1.2ms handler=0.8ms sql=0.3ms(2q) middleware=0.1ms routing=45ns
Content-Type: application/json
To get the full profile as JSON instead of the header summary, use ?profile=json:
profile = get_current_profile()
if profile:
data = profile.to_dict()
# Returns:
# {
# "method": "GET",
# "path": "/api/users",
# "total_ns": 1200000,
# "total": "1.2ms",
# "middleware_ns": 100000,
# "routing_ns": 45,
# "handler_ns": 800000,
# "sql_total_ns": 300000,
# "sql_count": 2,
# "queries": [
# {"sql": "SELECT * FROM users...", "duration_ns": 150000, "duration": "150.0us"},
# {"sql": "SELECT * FROM profiles...", "duration_ns": 150000, "duration": "150.0us"},
# ]
# }
RequestProfile¶
Every profiled request produces a RequestProfile dataclass:
from hyperdjango.profiling import RequestProfile, SQLQuery
@dataclass(slots=True)
class RequestProfile:
method: str = ""
path: str = ""
start_ns: int = 0
total_ns: int = 0
middleware_ns: int = 0
routing_ns: int = 0
handler_ns: int = 0
sql_total_ns: int = 0
sql_queries: list[SQLQuery] = field(default_factory=list)
Each SQL query is recorded as an SQLQuery:
Key methods on RequestProfile:
| Method | Returns | Description |
|---|---|---|
to_header() |
str |
Format as X-Profile header value |
to_dict() |
dict |
Full profile as JSON-serializable dict |
to_collapsed_stack() |
str |
Collapsed stack format for flame graphs |
sql_count |
int |
Number of SQL queries (property) |
Query Tracking¶
SQL queries are recorded into the current profile automatically when the database layer is instrumented. You can also record queries manually:
from hyperdjango.profiling import record_sql, nanos, elapsed_nanos
start = nanos()
result = await db.query("SELECT * FROM users WHERE active = $1", [True])
duration = elapsed_nanos(start)
record_sql("SELECT * FROM users WHERE active = $1", duration, params=(True,))
Per-request query statistics are available via the RequestProfile:
profile = end_profile()
print(f"Queries: {profile.sql_count}")
print(f"Total SQL time: {profile.sql_total_ns / 1_000_000:.1f}ms")
for q in profile.sql_queries:
print(f" {q.duration_ns / 1000:.0f}us {q.sql[:80]}")
Performance Middleware¶
The PerformanceMiddleware provides automatic per-request monitoring without decorating individual handlers:
from hyperdjango.performance import PerformanceMiddleware
perf = PerformanceMiddleware(
slow_query_threshold_ms=100.0, # Log queries slower than 100ms
n_plus_one_threshold=5, # Flag SQL patterns repeated 5+ times
max_history=1000, # Keep last 1000 request stats
dashboard_path="/debug/performance", # Dashboard URL
enabled=True,
)
app.use(perf)
Configuration Options¶
| Parameter | Default | Description |
|---|---|---|
slow_query_threshold_ms |
100.0 |
Queries slower than this are flagged as slow |
n_plus_one_threshold |
5 |
A SQL pattern repeated this many times triggers N+1 warning |
max_history |
1000 |
Ring buffer size for recent request stats |
dashboard_path |
"/debug/performance" |
URL path for the HTML dashboard |
enabled |
True |
Toggle middleware on/off |
Response Headers¶
The middleware adds headers to every response:
N+1 Detection¶
N+1 query detection works by normalizing SQL queries (replacing literals with ? placeholders) and counting how many times each normalized pattern appears in a single request. If any pattern exceeds the n_plus_one_threshold, it is flagged.
For example, if a request produces these queries:
SELECT * FROM articles
SELECT * FROM authors WHERE id = 1
SELECT * FROM authors WHERE id = 2
SELECT * FROM authors WHERE id = 3
SELECT * FROM authors WHERE id = 4
SELECT * FROM authors WHERE id = 5
The middleware normalizes them:
The second pattern appears 5 times, triggering the N+1 alert. The normalized SQL pattern is included in the X-N-Plus-One header and recorded in the request stats.
To fix N+1 queries, use select_related() or prefetch_related():
# N+1: one query per author
articles = await Article.objects.all()
for article in articles:
print(article.author.name) # Each access triggers a query
# Fixed: single JOIN query
articles = await Article.objects.select_related("author").all()
for article in articles:
print(article.author.name) # No additional queries
Performance Dashboard¶
The middleware serves an HTML dashboard at the configured dashboard_path (default: /debug/performance). The dashboard displays:
- Total Requests -- number of requests since server start
- Total Queries -- total SQL queries executed
- Avg Queries/Req -- average queries per request
- Slow Queries -- count of queries exceeding the threshold
- N+1 Patterns -- count of detected N+1 query patterns
- Recent Slow Queries -- table of recent slow queries with SQL and duration
- N+1 Query Patterns -- table of detected N+1 SQL patterns
A JSON API is also available at /debug/performance/json for programmatic access:
# GET /debug/performance/json
{
"total_requests": 1542,
"total_queries": 8734,
"avg_queries_per_request": 5.7,
"slow_query_count": 12,
"n_plus_one_count": 3,
"recent_slow_queries": [
{"sql": "SELECT * FROM ...", "duration_ms": 245.3},
],
"recent_n_plus_one": [
"SELECT * FROM authors WHERE id = ?"
]
}
SlowQueryLog Integration¶
For persistent slow query tracking that survives server restarts, use SlowQueryLog from the pool module:
from hyperdjango.pool import SlowQueryLog, QueryTimer
# Create persistent slow query log (PostgreSQL UNLOGGED table)
slow_log = SlowQueryLog(db, threshold_ms=100)
await slow_log.ensure_table()
# Auto-instrument all database queries
timer = QueryTimer(db, slow_log=slow_log, threshold_ms=100)
timer.install()
The SlowQueryLog writes to an UNLOGGED PostgreSQL table (hyper_slow_queries) with columns:
| Column | Type | Description |
|---|---|---|
id |
SERIAL |
Primary key |
sql_text |
TEXT |
The slow query SQL |
duration_ms |
REAL |
Query duration in milliseconds |
params_summary |
TEXT |
Summary of query parameters |
timestamp |
TIMESTAMPTZ |
When the query was recorded |
Indexes on timestamp DESC and duration_ms DESC enable fast lookups of recent and slowest queries.
Manual Profiling with QueryTimer¶
The QueryTimer wraps database methods with automatic timing:
from hyperdjango.pool import QueryTimer
timer = QueryTimer(db, threshold_ms=50)
timer.install() # Patches db.query and db.execute
# All subsequent queries are automatically timed
result = await db.query("SELECT * FROM large_table")
# If > 50ms, logged to SlowQueryLog
For manual timing of specific operations:
from hyperdjango.profiling import nanos, elapsed_nanos, start_profile, end_profile
# Profile a specific code block
profile = start_profile(method="TASK", path="generate_report")
start = nanos()
data = await compute_report(report_id)
profile.handler_ns = elapsed_nanos(start)
completed = end_profile()
print(completed.to_dict())
ProfileStore and Flame Graphs¶
The global ProfileStore accumulates profiles for analysis:
from hyperdjango.profiling import get_store
store = get_store()
# Get the 10 slowest requests
slowest = store.get_slowest(n=10)
for p in slowest:
print(f"{p.method} {p.path}: {p.total_ns / 1_000_000:.1f}ms ({p.sql_count} queries)")
# Generate collapsed stack format for flame graphs
collapsed = store.get_flame_graph()
The collapsed stack format is compatible with speedscope and Brendan Gregg's FlameGraph tools:
request;middleware 100000
request;routing 45000
request;handler;python 500000
request;handler;sql;SELECT * FROM users WHERE id = ? 150000
request;handler;sql;SELECT * FROM profiles WHERE user_id = ? 150000
To generate an SVG flame graph:
from hyperdjango.profiling import Profiler
profiler = Profiler()
with profiler.trace("my_operation"):
result = await expensive_function()
profiler.dump_flamegraph("profile.svg")
The ProfileStore is a thread-safe ring buffer (default capacity: 1000 profiles) that can be cleared with store.clear().
Production vs Development Profiling¶
Development¶
In development, enable full profiling with handler decorators and the performance dashboard:
from hyperdjango.profiling import profile_handler
from hyperdjango.performance import PerformanceMiddleware
# Profile all routes of interest
@app.get("/api/users")
@profile_handler
async def list_users(request):
...
# Enable query tracking and N+1 detection
app.use(PerformanceMiddleware(
slow_query_threshold_ms=50,
n_plus_one_threshold=3,
dashboard_path="/debug/performance",
))
Production¶
In production, use the persistent SlowQueryLog and disable the dashboard:
from hyperdjango.pool import SlowQueryLog, QueryTimer
from hyperdjango.performance import PerformanceMiddleware
# Persistent slow query log
slow_log = SlowQueryLog(db, threshold_ms=200)
await slow_log.ensure_table()
timer = QueryTimer(db, slow_log=slow_log, threshold_ms=200)
timer.install()
# Lightweight request stats (no dashboard)
app.use(PerformanceMiddleware(
slow_query_threshold_ms=200,
n_plus_one_threshold=10,
dashboard_path=None, # Disable dashboard in prod
enabled=True,
))
Key differences:
| Setting | Development | Production |
|---|---|---|
slow_query_threshold_ms |
50 ms | 200 ms |
n_plus_one_threshold |
3 | 10 |
| Dashboard | Enabled | Disabled |
@profile_handler |
Per-route | Disabled (zero overhead) |
SlowQueryLog |
Optional | Enabled |
X-Profile header |
Shown | Stripped by reverse proxy |
Benchmarking Patterns¶
Timing a Database Query¶
from hyperdjango.profiling import nanos, elapsed_nanos
iterations = 1000
start = nanos()
for _ in range(iterations):
await db.query("SELECT 1")
total = elapsed_nanos(start)
print(f"Avg: {total / iterations / 1000:.1f} us/query")
Comparing ORM vs Raw SQL¶
# ORM path
start = nanos()
users = await User.objects.filter(active=True).all()
orm_time = elapsed_nanos(start)
# Raw SQL path
start = nanos()
rows = await db.query("SELECT * FROM users WHERE active = $1", [True])
raw_time = elapsed_nanos(start)
print(f"ORM: {orm_time / 1_000_000:.1f}ms, Raw: {raw_time / 1_000_000:.1f}ms")
print(f"ORM overhead: {orm_time / raw_time:.1f}x")
Profiling Template Rendering¶
from hyperdjango.profiling import nanos, elapsed_nanos
start = nanos()
html = engine.render("complex_template.html", context)
render_time = elapsed_nanos(start)
print(f"Template render: {render_time / 1000:.0f} us")
Precision¶
- Timing resolution: ~45 ns (hardware timestamp counter via Zig
std.time.nanoTimestamp) - Zero overhead when profiling is disabled (no function wrapping, no thread-local access)
- Thread-safe timing via per-thread counters (
threading.local)
Profile-Driven Optimization Workflow (THE PROCESS)¶
Do not optimize what you think is slow. Measure it first. This section codifies the methodology for finding real throughput improvements that are hiding behind wrong assumptions.
The Rule¶
Every performance change must be preceded by a profile that identifies the hotspot, and followed by a re-profile that proves the fix worked. No "I think this will be faster" commits.
The Stability Rule¶
Any performance claim requires a multi-run median measurement with jitter under 5%. Single-run micro-benchmarks are inadmissible evidence.
Short-run measurements (100K-300K iterations at sub-microsecond per-op costs) have ±30% run-to-run variance on modern CPUs due to:
- CPU frequency scaling (P-cores ramping up/down)
- GC collection pauses
- Scheduler noise from other processes
- Inline cache + branch predictor warmup
Concrete example: scripts/bench_where_compile.py at 100K × 1 run reported
the Zig _where_compile speedup as "1.26-1.28x" in three successive runs.
At 500K × 5 runs (median), the actual speedup is 2.64-3.99x depending
on scenario — a 3x underestimate from bad measurement methodology.
Required benchmark structure¶
Every micro-benchmark script must:
- Total operations ≥ 1 million (or the measurement runs ≥ 1 second total)
- Warmup ≥ 10× the per-op call-site interpreter caches (500-10,000 ops typical)
- Multi-run median (N=3-5) with per-run values printed so jitter is visible
- Jitter percentage computed and reported —
(max - min) / median / 2 * 100 - Refuse to report a "win" if jitter > 5%
Reference implementations:
cProfile (in-process, structural identification):
scripts/profile_hypernews_cprofile.py— 5 hypernews endpoints, per-endpoint iteration counts so each run hits ≥5s at the endpoint's expected rps, ⚠ SHORT RUN warning if any run drops belowscripts/profile_list_cprofile.py— bookstore List endpoint, 10K iters × 3 runs since the endpoint is faster than hypernewsscripts/profile_admin_cprofile.py— HyperAdmin dashboard / changelist / add form / login page; logs in via/admin/login/then profiles each
wrk (wire-speed verification):
scripts/bench_hypernews_wrk.py— 5 hypernews endpoints, 4t/20c/8s × 3 runs, multi-run median, per-run jitter, drop+seed integration,HYPER_POOL_SIZEenv override forwarded to AppRunnerscripts/bench_bookstore_wrk.py— 6 bookstore REST endpoints, same structure as hypernews wrk
Microbench (single-function isolation):
scripts/bench_where_compile.py— 500K × 5 runs, jitter per scenarioscripts/bench_hmac_paths.py— 1M × 5 runs, per-run ns shownscripts/bench_from_record_bulk.py— 4-variant Model.from_record dispatchscripts/bench_time_ago.py— time_bucket_cached vs uncached vs lru_cache
Required endpoint profile structure¶
For in-process TestClient + cProfile profiling:
- ≥ 5,000 iterations per run (or total cProfile time ≥ 5 seconds per endpoint)
- Warmup ≥ 500 requests to prime: LRU cache, prepared statement cache, connection pool, template bytecode cache, OS page cache
- 3 runs minimum with
per_run_rps: [a, b, c]in the report - Jitter ≤ 5% required to quote the number; if higher, either increase iterations or investigate what's adding noise (CPU throttling, concurrent work, memory pressure)
What to write in a performance CHANGELOG entry¶
WRONG:
"/login endpoint improved from 3,132 to 5,746 rps (+83.5%)"
This is a single-run peak compared to a single-run baseline. The next run of the same code with no changes might produce 4,091 rps. The "+83.5%" is wishful thinking.
RIGHT:
"/login endpoint improved from 3,132 to 4,863 ± 1.6% rps (median of 3 runs × 5,000 iters) — +55.3%"
The jitter bound and per-run count make the number auditable. Future readers can tell whether a subsequent regression is real or within noise.
Why this rule exists¶
Intuition about bottlenecks is wrong most of the time. A function can look
hot in a micro-benchmark and contribute nothing to production throughput
because it's never in the top 30 by self-time. The only way to know is to
profile the actual endpoint first — cProfile will surface the real hotspot,
which often turns out to be a small Python method (MRO walks, attribute
lookups, repeated dict construction) fixable in a metaclass cache with no
native code.
Profiling is cheap. Always run it first.
The Six-Phase Workflow¶
Phase 0: Preconditions
- Release-mode Zig build active:
uv run hyper-build --release— profiling a Debug build wastes time; tracing overhead dominates the real signal - A stable reproducible workload (wrk hitting a real endpoint, or a scripted TestClient loop)
- The baseline numbers already recorded in a JSON file you can diff against later
Phase 1: Baseline throughput
Run the full production benchmark and save the JSON baseline:
uv run python scripts/bench_load_orm.py
cp logs/load_orm_baseline.json logs/profile_baseline_before.json
Write down the rps and avg_ms for every endpoint. This is your "before" state.
Phase 2: In-process cProfile of the slowest endpoint
Use scripts/profile_list_cprofile.py as a template (or create a sibling for
a different endpoint). The script does:
hyper setup --drop --seedto prepare a clean DB- Import the app and create an in-process
TestClient - Warmup (50 requests) so JIT caches, prepared statement cache, query cache are all populated
cProfile.Profile()over N iterations of the target endpoint- Dump stats sorted by both
tottime(self-time) andcumtime(cumulative) - Write JSON report with top-30 functions for programmatic comparison
Run it:
Read logs/profile_list_cprofile.txt top-15 by self-time. The answer will
almost always surprise you. Key questions:
- What percentage of total time is the top function? (If >10%, it's worth fixing)
- Are any of the top 10 pure Python? (Pure Python is cheap to optimize)
- What is the call count? (High count × moderate per-call cost is the fat middle that micro-benchmarks miss)
- Is your assumed hotspot in the top 30? (If not, do not work on it)
Phase 3: Identify the optimization target
Look for one of these patterns in the top entries:
- High self-time + high call count + pure Python → cacheable computation (e.g., 87ms / 50,000 calls / pure MRO walk → dict cache)
- High self-time + native function → irreducible floor, only beatable by changing the call pattern (fewer queries, batching, etc.)
- High self-time + I/O (kqueue, epoll, socket) → async scheduling issue, look at concurrency patterns
For each candidate, estimate the theoretical ceiling:
- Current time:
X ms total - If this function drops to zero:
X - Y ms = Z ms - Expected throughput improvement:
1/Z - 1/Xas a percentage
If the ceiling is <5%, keep looking. If the ceiling is >20%, you have a target.
Phase 4: Implement the fix
Write the minimum change that addresses the hotspot. Favor:
- Caching on class/module level (one-time cost, amortized across all calls)
- Avoiding allocations in hot loops
- Replacing dynamic attribute access (
getattr,type(x).__dict__.get) with pre-resolved direct references
Do NOT:
- Rewrite in Zig unless you've proven Python allocation or interpreter overhead is the bottleneck AND you've measured that FFI overhead is lower than what you save
- Add complexity (flags, indirection layers) to handle cases the profile didn't show
Phase 5: Re-profile and verify
Re-run Phase 2 (cProfile) and Phase 1 (production benchmark). Both must show improvement. If the cProfile drops but wrk stays flat, you optimized something that doesn't matter at wire speed (e.g., your fix only helps non-DB-bound endpoints, or the connection pool is saturated).
Record the delta as a markdown table comparing before/after rps, avg_ms, and the specific pstats entries you were targeting. Example:
Before: _get_compute_method = 87ms (14% total), avg_ms = 1.27
After: _get_compute_method = 20ms (4% total), avg_ms = 1.02
Wire: List endpoint 4,076 rps → 6,639 rps (+62.9%)
If the numbers don't match the theoretical ceiling, investigate why — either your profile didn't capture something, or your fix has a hidden cost.
Phase 6: Commit the report
Every optimization must include:
- The profiling script (so anyone can reproduce)
- The before/after numbers in
logs/profile_<target>_report.md - A test verifying the functional change is parity-preserving (e.g., existing serializer tests still pass)
- A CHANGELOG entry mentioning the real-world throughput delta
Worked Example: Serializer Compute-Method Cache¶
A canonical reference for "what this workflow produces":
Before:
cProfile top 3 by self-time on /api/v1/books/ (500 iter):
1. _db_query_dicts (Zig native): 111ms
2. _db_query (Zig native): 108ms
3. _get_compute_method: 87ms ← pure Python, 50,000 calls
Root cause: Serializer._get_compute_method walked the MRO (type(self).__mro__)
and called type(self).__dict__.get(name) on every candidate name, for every
field, for every serialized object. For a 10-item list with a 10-field
serializer, that's 100 field resolutions per request × ~16 mappingproxy gets
each = 1,600 MRO walks per request, almost all returning None.
Fix (in serializers.py, SerializerMeta.__new__):
# Pre-compute compute-method lookup table at class creation time.
compute_cache: dict[str, object] = {}
for candidate_name in dir(cls):
if candidate_name.startswith("_"):
continue
attr = cls.__dict__.get(candidate_name)
if attr is None:
for base in cls.__mro__[1:]:
attr = base.__dict__.get(candidate_name)
if attr is not None:
break
if attr is not None and callable(attr):
compute_cache[candidate_name] = attr
cls._compute_method_cache = compute_cache
def _get_compute_method(self, source: str):
cache = type(self)._compute_method_cache
method = cache.get(source)
if method is None:
method = cache.get(f"get_{source}")
if method is None:
return None
return lambda obj, m=method: m(self, obj)
After:
cProfile:
_get_compute_method: 87ms → 20ms (4.35x faster)
_serialize_one total: 1.27ms → 1.02ms avg (20% faster)
Production (wrk 16t/200c/10s):
List: 4,076 → 6,639 rps (+62.9%)
Reviews: 5,401 → 7,424 rps (+37.4%)
Stats: 9,920 → 10,474 rps (+5.6%)
Detail: 6,962 → 6,959 rps (0%, FK-bound)
Search: 4,163 → 4,159 rps (0%, FTS-bound)
Health: 11,653 → 11,648 rps (0%, no serializer)
Total cost: ~15 lines of Python, zero new tests needed (existing 74 pass), one file touched. The profile made the win obvious — the code change was trivial.
Anti-Patterns to Avoid¶
- "This looks like it could be slow" — without a profile, you are guessing.
- Profiling a Debug build — Zig debug traces can emit megabytes of stderr
and dominate the wall-clock time. Always
uv run hyper-build --releasefirst. - Using py-spy without
--nativeon macOS — py-spy can't sample Zig frames on macOS. Use in-processcProfileviaTestClientinstead; it gets Python frames without sudo and without platform-specific native support. - Measuring one endpoint, optimizing another — the hotspot for List may not be the hotspot for Search. Profile the endpoint whose throughput you want to improve.
- Optimizing without a re-profile — you must prove the fix worked. Run the profile again after the change and diff the top-15 table. If the targeted function didn't drop, the fix is wrong.
- Chasing microseconds on a millisecond call — a function that takes
2μs in a 5,000μs request is 0.04% of total time. Even a 100x speedup
there yields a 0.04% improvement. The per-call cost matters less than
calls × tottime— always sort by self-time first.
Measurement caveats¶
- Wire-speed wrk numbers from single-run comparisons should be treated with caution. Back-to-back wrk runs against the same code on an unloaded machine can vary by ±20%. cProfile per-call self-time numbers are the reliable attribution — they exclude DB roundtrip variance.
- When wire-speed claims contradict cProfile claims, trust cProfile. A 24% per-call cProfile delta is a Python structural win; a 40% wrk delta on the same change is at least partly environment.
- Matched-state baselines matter: if you measure "before" on Monday and "after" on Tuesday, the delta is noise. Measure before + after within the same session, ideally within minutes of each other.
Tools¶
| Tool | Use When | Cost |
|---|---|---|
cProfile + pstats |
First-pass hotspot discovery in Python code | Zero setup, in-process, portable |
py-spy |
Sampling profile under sustained real load | Requires sudo on macOS, --native Linux-only |
scripts/profile_hypernews_cprofile.py |
Multi-endpoint cProfile with Stability Rule built-in | Copy & modify endpoint set; per-endpoint iter counts |
scripts/profile_list_cprofile.py |
bookstore List cProfile (template for any single REST endpoint) | Copy and modify the target path |
scripts/profile_admin_cprofile.py |
HyperAdmin pages with auth flow handled | Logs in via /admin/login/ automatically |
scripts/profile_openapi_cprofile.py |
OpenAPI spec build via direct call + HTTP path | Two-scenario harness |
scripts/profile_queue_channel_cprofile.py |
task_queue enqueue/execute + channel.publish | Multi-scenario |
scripts/profile_write_cprofile.py |
REST framework write/POST path | Counterpart to profile_list_cprofile.py |
scripts/bench_hypernews_wrk.py |
Hypernews wire-speed verification | Multi-run median, jitter tracked, drop+seed integrated |
scripts/bench_bookstore_wrk.py |
Bookstore REST wire-speed verification | Same structure, different endpoint set |
scripts/_wrk_bench.py |
Shared wrk harness — import WrkBenchmark, WrkTarget, run_wrk_benchmark |
Use this when adding a new wrk bench |
scripts/bench_pool_queue_depth.py |
pg.zig pool acquire contention sampler during wrk run | Histograms waiters, recommends pool sizing |
scripts/bench_db_query_dicts.py |
Row-shape microbench: _db_query vs _db_query_dicts |
Isolates per-row dict construction |
scripts/bench_load_orm.py |
Original wrk bench, deeper concurrency (200 conns) | Single-run, no median |
scripts/bench_profile.py |
Per-endpoint header inspection (X-Query-Count) | Relies on PerformanceMiddleware being wired |
scripts/bench_where_compile.py |
Micro-benchmark for _where_compile |
Only trust if function is in cProfile top 10 |
scripts/bench_from_record_bulk.py |
Micro-benchmark for Model hydration variants | Hydration variant comparison |
scripts/bench_time_ago.py |
Micro-benchmark for time_bucket_cached | Bucket cache microbench |
Built-in @profile_handler |
Per-request X-Profile header timing breakdown | Zero external deps, in-process |
When to use which¶
- Start with
cProfile(viaprofile_list_cprofile.py). It's the fastest path from question to answer. 500 iterations complete in <1 second and surface Python-level hotspots with full call counts. - Follow with
bench_load_orm.pyfor the before/after throughput delta. This is the only number that matters for the CHANGELOG. - Use
py-spyonly when cProfile points to a native function and you need to see inside it (which on macOS you cannot, so this is Linux-only). -
Use
bench_where_compile.py-style micro-benchmarks only after cProfile has confirmed the function is in the top 10. Don't write a micro-benchmark for a function you think might be slow. -
Profile storage uses a thread-safe ring buffer with a configurable capacity