# Incident Report — 2026-04-09 Meeting pipeline timeout and DB lock failures - Date: 2026-04-09 - Project root: `/home/clawdbot/clawd/Event_management` - Session/report under investigation: `timber-cedar-golden` - Trigger report: `/home/clawdbot/clawd/Event_management/reports/pipeline-timber-cedar-golden.md` - Status: investigated ## Summary Today’s Event Management daily pipeline failed partially because the `meeting.vienna.info` path hit two separate failures: 1. `meeting:scrape` failed with `class=infra_timeout` after **2400.11s** 2. `meeting:manual_phase4` failed with `class=script_error` after **38.0s** The successful stages were: - `salzburg:scrape` - `expo:scrape` - `meeting:deep_dive` - `salzburg:deep_dive` - `expo:deep_dive` - `salzburg:manual_phase4` - `expo:manual_phase4` This was not a full pipeline outage. It was a concentrated failure in the `meeting.vienna.info` branch. ## Impact - Daily pipeline finished in failed state. - `meeting.vienna.info` scrape coverage was incomplete because the scrape phase timed out before completion. - `meeting.vienna.info` manual phase 4 processed **0** items because the stage crashed on DB write lock. - Salzburg and Expo sources still completed normally. - Operationally, this reduces confidence in the daily 10:00 pipeline for the most expensive source, exactly where resilience matters most. ## Evidence collected ### 1. Daily report evidence From `/home/clawdbot/clawd/Event_management/reports/pipeline-timber-cedar-golden.md`: - `meeting:scrape: failed (2400.11s, class=infra_timeout)` - `meeting:manual_phase4: failed (38.0s, class=script_error)` ### 2. Meeting scrape was still progressing right before timeout The `meeting-vienna-info.ndjson` stream shows the scraper was still actively finishing pass2 attempts seconds before the stage timeout. This means the timeout was not caused by a dead browser/CDP collapse. The job was still working, but exceeded the hard stage budget. ### 3. Lock contention was real during the scrape stage `meeting-vienna-info-anomalies.ndjson` for the 2026-04-09 08:00 UTC run shows: - **69** `pass2_attempt_failed` anomalies in the run window - of those, **36** were `database is locked` - cumulative lock-delay from those failed attempts: **~254.5 seconds** - cumulative failed-attempt duration overall: **~674.4 seconds** This is large enough to materially slow the stage and contribute to the timeout. ### 4. Manual phase 4 failed on a direct DB write lock The `meeting-vienna-info-manual-phase4-anomalies.ndjson` stream shows: - `phase4_script_error` - `sqlite3.OperationalError: database is locked` - failing write statement: - `UPDATE events SET contact_form=? WHERE id=?` This is a direct write-lock failure, not a parsing/data issue. ### 5. DB settings increase lock sensitivity Current DB runtime settings on `/home/clawdbot/clawd/Event_management/data/event_management.db`: - `journal_mode = delete` - `busy_timeout = 5000` - `locking_mode = normal` That means: - rollback-journal mode instead of WAL - only 5 seconds of wait on DB contention - no durable writer-side retry strategy in the meeting scripts ### 6. Code path evidence Current connection setup in the affected scripts uses plain default SQLite connections: - `/home/clawdbot/clawd/Event_management/scripts/meeting_vienna_scraper.py:376` - `con = sqlite3.connect(DB)` - `/home/clawdbot/clawd/Event_management/scripts/meeting_vienna_deep_dive.py:151` - `con = sqlite3.connect(DB)` - `/home/clawdbot/clawd/Event_management/scripts/_manual_phase4_common.py:339` - `con = sqlite3.connect(DB)` The pipeline also hard-codes the meeting scrape timeout at: - `/home/clawdbot/clawd/Event_management/scripts/event_manager_pipeline.py` - `STAGE_TIMEOUTS['meeting:scrape'] = 2400` ## Root cause ### Primary root cause The `meeting.vienna.info` stages are **not concurrency-safe under SQLite lock contention**. Specifically: - they use default SQLite connections - they do not enable WAL mode in-script - they do not set a stronger `busy_timeout` - they do not wrap writes in a retry/backoff strategy for transient locks As a result: - the scrape phase accumulated many `database is locked` failures and lost several minutes to contention - the manual phase 4 stage failed outright on a single lock during `UPDATE events SET contact_form=? WHERE id=?` ### Contributing factor The hard stage budget for `meeting:scrape` is now too tight relative to current workload. Evidence: - current hard limit: **2400s** - previous successful scrape baseline in the logs was already roughly **1854.944s** (~30.9 min) - today’s lock delays alone contributed **~254.5s** - the scraper was still making progress when the timeout killed it So the stage timeout is no longer a safe ceiling once lock stalls and normal variance are included. ## Bug record Primary bug recorded in: - `/home/clawdbot/clawd/Event_management/docs/bugs.jsonl` Remediation plan recorded in: - `/home/clawdbot/clawd/Event_management/docs/remediation-plan-2026-04-09-meeting-pipeline-db-lock-and-timeout.md` ## Recommended remediation direction 1. Move Event Management DB access to WAL mode for mixed read/write concurrency. 2. Apply explicit `busy_timeout` and writer retry/backoff in all meeting scripts. 3. Make write operations resilient to transient locks instead of failing the stage immediately. 4. Raise or dynamically tune the `meeting:scrape` stage timeout. 5. Add timeout/lock telemetry so future regressions are visible before they become daily-run failures. ## Implementation follow-up Remediation has now been implemented and the bug record has been updated to `mitigated` pending runtime validation. Implementation artifacts: - `/home/clawdbot/clawd/Event_management/scripts/_event_db.py` - `/home/clawdbot/clawd/Event_management/scripts/meeting_vienna_scraper.py` - `/home/clawdbot/clawd/Event_management/scripts/meeting_vienna_deep_dive.py` - `/home/clawdbot/clawd/Event_management/scripts/_manual_phase4_common.py` - `/home/clawdbot/clawd/Event_management/scripts/event_manager_pipeline.py` Post-implementation DB validation: - `journal_mode = wal` - `busy_timeout = 30000` ## Notes This incident was concentrated in the `meeting.vienna.info` source. Salzburg and Expo did not show the same failure pattern in this run, which supports the conclusion that the bug is in the Meeting Vienna path and its DB/timeout handling rather than the whole pipeline runner.