Report

# Incident Report — 2026-04-09 Meeting pipeline timeout and DB lock failures

- Date: 2026-04-09
- Project root: `/home/clawdbot/clawd/Event_management`
- Session/report under investigation: `timber-cedar-golden`
- Trigger report: `/home/clawdbot/clawd/Event_management/reports/pipeline-timber-cedar-golden.md`
- Status: investigated

## Summary

Today’s Event Management daily pipeline failed partially because the `meeting.vienna.info` path hit two separate failures:

1. `meeting:scrape` failed with `class=infra_timeout` after **2400.11s**
2. `meeting:manual_phase4` failed with `class=script_error` after **38.0s**

The successful stages were:
- `salzburg:scrape`
- `expo:scrape`
- `meeting:deep_dive`
- `salzburg:deep_dive`
- `expo:deep_dive`
- `salzburg:manual_phase4`
- `expo:manual_phase4`

This was not a full pipeline outage. It was a concentrated failure in the `meeting.vienna.info` branch.

## Impact

- Daily pipeline finished in failed state.
- `meeting.vienna.info` scrape coverage was incomplete because the scrape phase timed out before completion.
- `meeting.vienna.info` manual phase 4 processed **0** items because the stage crashed on DB write lock.
- Salzburg and Expo sources still completed normally.
- Operationally, this reduces confidence in the daily 10:00 pipeline for the most expensive source, exactly where resilience matters most.

## Evidence collected

### 1. Daily report evidence
From `/home/clawdbot/clawd/Event_management/reports/pipeline-timber-cedar-golden.md`:
- `meeting:scrape: failed (2400.11s, class=infra_timeout)`
- `meeting:manual_phase4: failed (38.0s, class=script_error)`

### 2. Meeting scrape was still progressing right before timeout
The `meeting-vienna-info.ndjson` stream shows the scraper was still actively finishing pass2 attempts seconds before the stage timeout. This means the timeout was not caused by a dead browser/CDP collapse. The job was still working, but exceeded the hard stage budget.

### 3. Lock contention was real during the scrape stage
`meeting-vienna-info-anomalies.ndjson` for the 2026-04-09 08:00 UTC run shows:
- **69** `pass2_attempt_failed` anomalies in the run window
- of those, **36** were `database is locked`
- cumulative lock-delay from those failed attempts: **~254.5 seconds**
- cumulative failed-attempt duration overall: **~674.4 seconds**

This is large enough to materially slow the stage and contribute to the timeout.

### 4. Manual phase 4 failed on a direct DB write lock
The `meeting-vienna-info-manual-phase4-anomalies.ndjson` stream shows:
- `phase4_script_error`
- `sqlite3.OperationalError: database is locked`
- failing write statement:
- `UPDATE events SET contact_form=? WHERE id=?`

This is a direct write-lock failure, not a parsing/data issue.

### 5. DB settings increase lock sensitivity
Current DB runtime settings on `/home/clawdbot/clawd/Event_management/data/event_management.db`:
- `journal_mode = delete`
- `busy_timeout = 5000`
- `locking_mode = normal`

That means:
- rollback-journal mode instead of WAL
- only 5 seconds of wait on DB contention
- no durable writer-side retry strategy in the meeting scripts

### 6. Code path evidence
Current connection setup in the affected scripts uses plain default SQLite connections:
- `/home/clawdbot/clawd/Event_management/scripts/meeting_vienna_scraper.py:376`
- `con = sqlite3.connect(DB)`
- `/home/clawdbot/clawd/Event_management/scripts/meeting_vienna_deep_dive.py:151`
- `con = sqlite3.connect(DB)`
- `/home/clawdbot/clawd/Event_management/scripts/_manual_phase4_common.py:339`
- `con = sqlite3.connect(DB)`

The pipeline also hard-codes the meeting scrape timeout at:
- `/home/clawdbot/clawd/Event_management/scripts/event_manager_pipeline.py`
- `STAGE_TIMEOUTS['meeting:scrape'] = 2400`

## Root cause

### Primary root cause
The `meeting.vienna.info` stages are **not concurrency-safe under SQLite lock contention**.

Specifically:
- they use default SQLite connections
- they do not enable WAL mode in-script
- they do not set a stronger `busy_timeout`
- they do not wrap writes in a retry/backoff strategy for transient locks

As a result:
- the scrape phase accumulated many `database is locked` failures and lost several minutes to contention
- the manual phase 4 stage failed outright on a single lock during `UPDATE events SET contact_form=? WHERE id=?`

### Contributing factor
The hard stage budget for `meeting:scrape` is now too tight relative to current workload.

Evidence:
- current hard limit: **2400s**
- previous successful scrape baseline in the logs was already roughly **1854.944s** (~30.9 min)
- today’s lock delays alone contributed **~254.5s**
- the scraper was still making progress when the timeout killed it

So the stage timeout is no longer a safe ceiling once lock stalls and normal variance are included.

## Bug record

Primary bug recorded in:
- `/home/clawdbot/clawd/Event_management/docs/bugs.jsonl`

Remediation plan recorded in:
- `/home/clawdbot/clawd/Event_management/docs/remediation-plan-2026-04-09-meeting-pipeline-db-lock-and-timeout.md`

## Recommended remediation direction

1. Move Event Management DB access to WAL mode for mixed read/write concurrency.
2. Apply explicit `busy_timeout` and writer retry/backoff in all meeting scripts.
3. Make write operations resilient to transient locks instead of failing the stage immediately.
4. Raise or dynamically tune the `meeting:scrape` stage timeout.
5. Add timeout/lock telemetry so future regressions are visible before they become daily-run failures.

## Implementation follow-up

Remediation has now been implemented and the bug record has been updated to `mitigated` pending runtime validation.

Implementation artifacts:
- `/home/clawdbot/clawd/Event_management/scripts/_event_db.py`
- `/home/clawdbot/clawd/Event_management/scripts/meeting_vienna_scraper.py`
- `/home/clawdbot/clawd/Event_management/scripts/meeting_vienna_deep_dive.py`
- `/home/clawdbot/clawd/Event_management/scripts/_manual_phase4_common.py`
- `/home/clawdbot/clawd/Event_management/scripts/event_manager_pipeline.py`

Post-implementation DB validation:
- `journal_mode = wal`
- `busy_timeout = 30000`

## Notes

This incident was concentrated in the `meeting.vienna.info` source. Salzburg and Expo did not show the same failure pattern in this run, which supports the conclusion that the bug is in the Meeting Vienna path and its DB/timeout handling rather than the whole pipeline runner.