Report

← reports
# Validation Report — 2026-04-09 Meeting DB-lock fix and manual phase4 timeout follow-up

- Date: 2026-04-09
- Project root: `/home/clawdbot/clawd/Event_management`
- Validation run session: `golden-spruce-leaf`
- Validation report source: `/home/clawdbot/clawd/Event_management/reports/pipeline-golden-spruce-leaf.md`

## Goal
Validate the remediation implemented for the Meeting Vienna DB-lock and scrape-timeout incident, then determine the next remaining failure mode if any.

## Result summary

The validation run produced a **mixed but useful** result.

### Validated as fixed/mitigated
- `meeting:scrape` passed
- `meeting:deep_dive` passed
- no recurrence of the old `database is locked` hard stage failure was observed in the validated run
- DB runtime remained in the hardened state:
  - `journal_mode = wal`
  - `busy_timeout = 30000`

### Still failing
- `meeting:manual_phase4` still failed
- failure class:
  - `infra_timeout`
- duration:
  - `2400.1s`

## Stage outcomes from validation run

From `pipeline-golden-spruce-leaf.md`:
- `meeting:scrape` → `ok` in `2174.22s`
- `meeting:deep_dive` → `ok` in `367.39s`
- `meeting:manual_phase4` → `failed` in `2400.1s` (timeout)

## What this validates

### 1. Original scrape timeout + DB-lock bug is materially improved
This is the main good news.

Yesterday’s failing Meeting path looked like:
- scrape timeout at 2400s
- later manual phase4 crash on `database is locked`

Today’s validation run shows:
- scrape no longer times out at the old 2400s ceiling
- deep_dive completes normally
- the lock-crash pattern is no longer the observed failure mode

That means the DB hardening and scrape-timeout change did real work.

### 2. A second bottleneck is now exposed cleanly
Once the lock/crash noise was removed, the next real bottleneck became visible:
- `meeting:manual_phase4` runtime design is too heavy for the current timeout budget

## Root cause of the remaining manual phase4 failure

### Culprit
The core problem is a **runtime-budget mismatch inside `_manual_phase4_common.py`**.

The code claims:
- `MAX_EVENT_SECONDS = 60`
- `MAX_PAGE_VISITS_PER_EVENT = 8`

But in reality each page visit can block for:
- `pg.goto(..., timeout=60000)`
- `pg2.goto(..., timeout=60000)`

That means the real worst case is not 60s per event.
It can be up to roughly:
- 1 target page × 60s
- plus up to 7 hop pages × 60s
- = **up to ~480s per event** before caps stop it

So the code enforces a 60-second event budget only **between navigations**, not **around** them.

### Why that matters
The timeout checks happen before each navigation call:
- `if time.time() - event_t0 > MAX_EVENT_SECONDS:`

But once a `goto(..., timeout=60000)` starts, that single call can consume the entire event budget by itself.

So the event budget is not actually a hard event budget.
It is more like a polite suggestion.

## Evidence from the validation run

### 1. Manual phase4 candidates
Current validation run started with:
- `candidates = 30`

### 2. Heavy per-event runtime behavior
For the current run:
- several events took ~60–70s each
- repeated blocked/form-only cases hit page-visit caps and/or many follow-up pages

Examples seen in the live phase4 logs:
- `Favoriten in der Kardiologie`
  - `69.34s`
  - `page_visits = 8`
  - `phase4_tag = manual_blocked`
- `10. D-A-C-H – Symposium ...`
  - `66.03s`
  - `phase4_tag = manual_blocked`
- `25th Annual Conference on European Tort Law`
  - `65.74s`
  - `phase4_tag = manual_recovered`
- `REAL CORP 2026 ...`
  - `63.98s`
  - `page_visits = 8`
  - `phase4_tag = manual_form_only`

### 3. The shape of expensive cases
The worst cases are not necessarily the ones that recover useful data. They are often:
- blocked targets
- Cloudflare-ish or anti-bot targets
- targets with many contact-like fallback pages
- domains like `registration.maw.co.at` or other conference microsites where multiple fallback hops are attempted

So the script burns a lot of budget even when the outcome is just `manual_blocked` or `manual_form_only`.

## Practical interpretation

The validated situation is now:

### Fixed enough to count as progress
- DB lock contention no longer appears to be the active killer
- scrape budget problem is mitigated

### Newly exposed remaining bug
- manual phase4 still has a structural runtime bug
- its per-event caps are weaker than they look
- the stage timeout of 2400s is not realistic for 30 expensive candidates under the current navigation model

## Recommended next remediation direction

1. Add a real per-event deadline around browser navigation calls
   - effective timeout for each `goto()` should shrink based on remaining event budget
2. Reduce `MAX_PAGE_VISITS_PER_EVENT` for Meeting Vienna or make it source-specific
3. Prioritize low-cost/high-value targets first and stop early for low-value fallback paths
4. Add checkpoint/resume for manual phase4 so timing out does not waste partial progress
5. Revisit stage timeout after runtime shaping, not just by blindly increasing it

## Related bug records

Existing mitigated bug:
- DB lock + scrape timeout issue
- `/home/clawdbot/clawd/Event_management/docs/bugs.jsonl`

New runtime bug should be tracked separately because it is a distinct root cause:
- manual phase4 per-event timeout mismatch / stage budget exhaustion

## Bottom line
The first remediation worked well enough to expose the next real culprit.

That culprit is not DB locking anymore.
It is the mismatch between:
- nominal event budget (`60s`)
- actual browser wait behavior (`60s` per navigation, repeated multiple times)

In short: the script thinks in fox-time, but the browser waits in glacier-time.