抓取 pipeline 决策表 / mstage 标签速查¶
触发场景:改 pipeline 路由 / 改 UI 显示 mstage / 加新 mstage 类型 / 修 chip 颜色 — 必先查这里。 来源:v0.10.87 SPEC-004 Phase 1 实施;v0.10.92 规则化(dogfood 中两次踩 "UI 漏改" 坑)。
1. 完整决策树¶
URL
↓
[① 黑名单] (上层 batch-controller 过滤,不在 pipeline)
↓
[② HEAD probe] website-probe.ts,3s timeout
├─ dead → skip → DB scrape_status=2 → page-log error='mstage:dead@probe — ...'
├─ retry-later (5xx/429) → skip → DB scrape_status=2 → 'mstage:retry-later@probe — ...'
├─ antibot (cf-mitig/403+cf-ray/403/451) → fallback → 'mstage:fallback/antibot-head@probe'
└─ ok → 进入 ③
↓
[③ GET fetch] website-fetcher.ts,5s timeout,2MB cap
├─ non-html → skip → 'mstage:non-html@fetch — ...'
├─ antibot (body marker) → fallback → 'mstage:fallback/antibot-body@fetch'
├─ too-small (<1KB) → fallback → 'mstage:fallback/fetch-too-small@fetch'
├─ fail (network) → fallback → 'mstage:fallback/fetch-fail@fetch'
└─ ok → 进入 ④
↓
[④ Regex 解析] contact-extractor.ts
├─ contactCount ≥ 1 → success → DB 写 emails/phones/socials + 'mstage:success — success: N contacts (emails=X phones=Y)'
├─ 0 + hasContactKeyword/hasSocialKeyword → fallback → 'mstage:fallback/contact-keyword-but-empty@extract'
└─ 0 + 无关键词 → skip → 'mstage:no-contact@extract — no-keywords-no-data'
↓
[⑤ tab fallback] 旧路径 scrapeWithTab — 真开 chrome.tabs
(fallback 时落到这里;page-log 会再写一条 opened=true 的真抓记录)
2. mstage 标签 → UI 渲染对照表¶
唯一 source of truth:src/utils/mstage-classify.ts 的 classifyMstage(error)。
所有 UI 渲染必须 import 这个函数。已修代码示例:
- src/sections/page/log-view.tsx(日志页)
- src/sections/data/data-view.tsx(商家列表 HTTP 列)
| 前缀 / 标签 | kind | sx.color | Chip color prop | tooltipPrefix | shortLabel | DB scrape_status | opened |
|---|---|---|---|---|---|---|---|
mstage:success — ... |
success | success.main | success 🟢 | 多阶段命中(未开 tab) | ✓ 命中 | 2 | false |
mstage:dead@probe — ... |
skip | text.secondary | default ⚪ | 跳过抓取 | 跳过 | 2 | false |
mstage:no-contact@extract — ... |
skip | text.secondary | default ⚪ | 跳过抓取 | 跳过 | 2 | false |
mstage:non-html@fetch — ... |
skip | text.secondary | default ⚪ | 跳过抓取 | 跳过 | 2 | false |
mstage:domain-dead@domain-state ⭐ Phase 2 |
skip | text.secondary | default ⚪ | 跳过抓取 | 跳过 | 2 | false |
mstage:domain-cold@domain-state ⭐ Phase 2 |
skip | text.secondary | default ⚪ | 跳过抓取 | 跳过 | 2 | false |
mstage:retry-later@probe — ... |
retry-later | warning.dark | warning 🟠 | 稍后重试(5xx/429) | 重试 | 2(本期) | false |
mstage:antibot-head@probe — ... |
antibot | warning.main | warning 🟠 | 反爬拦截 / 落 tab | 反爬 | 不动 | false(前置 log) |
mstage:antibot-body@fetch — ... |
antibot | warning.main | warning 🟠 | 反爬拦截 / 落 tab | 反爬 | 不动 | false |
mstage:fallback/antibot-*@* |
antibot | warning.main | warning 🟠 | 反爬拦截 / 落 tab | 反爬 | 不动 | false |
mstage:fallback/domain-antibot-hard@domain-state ⭐ Phase 2 |
antibot | warning.main | warning 🟠 | 反爬拦截 / 落 tab | 反爬 | 不动 | false |
mstage:fallback/fetch-fail@fetch |
fallback | info.main | info 🔵 | 降级到 tab 抓取 | 落tab | 不动 | false |
mstage:fallback/fetch-too-small@fetch |
fallback | info.main | info 🔵 | 降级到 tab 抓取 | 落tab | 不动 | false |
mstage:fallback/contact-keyword-but-empty@extract |
fallback | info.main | info 🔵 | 降级到 tab 抓取 | 落tab | 不动 | false |
mstage:* 其他(未来扩展) |
mstage-other | text.secondary | default ⚪ | 多阶段事件 | mstage | 不定 | 不定 |
不以 mstage: 开头(真错误,如 ERR_CERT_AUTHORITY_INVALID / ERR_NAME_NOT_RESOLVED) |
real-error | error.main | error 🔴 | 抓取失败 | 截断后 8 字 | 3(失败) | true |
关键约束:
- ❌ 绝不能对
mstage:*标签渲染 "抓取失败" 文案 —mstage:success是成功事件! - ✅ 用
classifyMstage(error).tooltipPrefix拼 tooltip,不要硬编码 - ✅ 用
classifyMstage(error).chipColor设 chip color prop,不要硬编码color="error" - ✅ 用
classifyMstage(error).shortLabel显示在 chip 上,不要error.slice(0,12)自己截断 - ✅ 加新 mstage 类型时先在 mstage-classify.ts 加 case,再改 scraper-executor 写 page-log
3. PipelineOutcome → page-log 编码规则¶
唯一 source of truth:src/utils/scraper-executor.ts 的 applyPipelineOutcome。
// outcome.kind === 'success'
appendPageLog({
type: 'website', url, opened: false,
status: probe?.status ?? fetch?.status,
error: `mstage:success — ${summarizeOutcome(outcome)}`, // 关键
emails: data.emails.length,
phone: ..., socials: socialList,
});
// outcome.kind === 'skip'
appendPageLog({
type: 'website', url, opened: false,
status: ...,
error: `mstage:${outcome.reason}@${outcome.stage} — ${outcome.detail}`,
emails: 0, phone: '', socials: [],
});
// outcome.kind === 'fallback' ← 不写 DB,只写 fallback 标签 page-log,然后继续走 tab 路径
appendPageLog({
type: 'website', url, opened: false,
status: ...,
error: `mstage:fallback/${outcome.reason}@${outcome.stage}`,
emails: 0, phone: '', socials: [],
});
// ... 紧跟一条真实 tab 抓取的 page-log(opened: true)
4. 加新 mstage 类型的流程¶
步骤(每一步漏一个就会出 bug):
- scraper-executor.applyPipelineOutcome:决定新 outcome 的
error字段编码格式(前缀必须mstage:) - mstage-classify.ts classifyMstage:加新
if (error.startsWith('mstage:xxx'))分支,给 color / chipColor / tooltipPrefix / shortLabel - 本表(rules/scrape-pipeline-decision-table.md):第 2 节新增一行
- wiki/multi-stage-scrape-pipeline.md:决策树补新节点
- docs/issues/XXXX-xxx.md(如果是修 bug 引入的):归档
- 测试:在 dogfood 数据中确认 UI 显示符合预期
5. 常见混淆点¶
| 错觉 | 实情 |
|---|---|
| "mstage:success 标签 chip 应该是 success(绿色)— 为什么 hover 显示'抓取失败'?" | 老 UI 把 e.error 都当真错误,没分类。v0.10.90 / v0.10.92 已修 |
| "matsudental.com 能打开为啥显示失败?" | 它实际是 mstage:success,UI 漏改的视觉误导 |
| "fallback 是不是失败?" | 不是。fallback 表示"客户端轻量层搞不定,落 tab"。tab 抓完后才有真 success/failure |
| "DB scrape_status=2 是不是表示成功?" | 不一定。2 = "已抓"(pipeline 处理过即可),即使 skip/dead 也标 2 防重复挑 |
| "为什么 mstage:antibot 在表里 DB 列写 '不动'?" | antibot 走 fallback,由后续 tab 路径决定 scrape_status |
6. System-Log 事件清单(v0.10.96 起)¶
除了 page-log(专属"抓取页面"),所有 pipeline 内部事件走 src/utils/system-log.ts。
日志页"系统事件" tab 可见 + "导出全部"按钮下载 JSON 给开发者分析。
6.1 完整事件清单¶
| category | event | level | detail 字段 | 来源 |
|---|---|---|---|---|
pipeline |
pool-hit |
info | url / source / emails / phones | 命中 ContactPool 缓存 |
pipeline |
domain-advice |
info | url / decision / reason | domain-state 短路(仅非 continue 决策记录) |
domain-state |
transition |
info | domain / from / to / fetchTotal / fetchOk | 状态转换 |
domain-state |
ttl-reset |
info | domain / previousState | TTL 过期自动重置为 unknown |
domain-state |
reset-all |
warn | — | 用户手动重置所有 domain 数据 |
domain-state |
reset-failed |
error | error | reset jsstore.clear 抛错 |
contact-pool |
write |
info | urlHash / domain / isNew / emails / phones / method | 抓到 contact 写入本地池 |
contact-pool |
write-failed |
error | urlHash / error | jsstore upsert 抛错 |
contact-pool |
hit |
info | urlHash / source / emails / phones | 查询命中 |
contact-pool |
pull-cloud |
info | count | 拉云端 records 写入本地 |
contact-pool |
pull-cloud-failed |
error | error | 拉云端失败 |
contribution |
upload-new / upload-verify / query-hit / query-miss / reset-local |
info | localId / amount / ref | 本地记账 |
contribution |
record-failed |
error | action / error | 记账写入失败 |
cloud-sync |
upload-start |
info | task / count | 开始上传一批 |
cloud-sync |
upload-ok |
info | task / accepted / rejected / contributionEarned | 上传成功 |
cloud-sync |
upload-error |
error | task / code / message / count | 上传失败 |
cloud-sync |
pull-domain-state |
info | count | 拉云端 domain state |
cloud-sync |
pull-domain-state-failed |
error | error | 拉失败 |
cloud-sync |
auth-failed |
warn | task | ensureAuth 失败 |
cloud-sync |
skip-backoff |
debug | task | 指数退避期间跳过本次 |
cloud-sync |
startup-begin / startup-health-ok / startup-health-fail / startup-auth-fail / startup-done |
info/warn | 各自 | SW 启动 sync 流程 |
6.2 加新事件类型的流程¶
- 选 category(pipeline / domain-state / contact-pool / cloud-sync / contribution / general)
- 选 level(debug / info / warn / error)
- 调
appendSysLog(category, event, detail, level) - 在本表 §6.1 加一行(记录 detail 字段)
- 如果新 category,在
sys-log-list.tsx加CATEGORY_LABELS/CATEGORY_COLORS
6.3 导出 bundle 结构¶
handleExport 下载的 JSON 含:
{
"exportedAt": 1717000000000,
"exportedAtIso": "2026-05-28T...",
"exportedFromVersion": "0.10.96",
"summary": {
"sysLog": { "total": N, "byCategory": {...}, "byLevel": {...} },
"domain": { "total": N, "byState": {...} },
"contactPool": { "total": N, "bySource": {...}, "pendingUpload": N },
"ledger": { "total": N, "pendingSync": N, "localBalance": N },
"cloudDomain": { "total": N },
"pageLogCount": N
},
"sysLogs": [SysLogEntry, ...],
"pageLogs": [PageLogEntry, ...],
"domainStats": [DomainStat, ...],
"settings": { ... }
}
相关¶
- 多阶段抓取pipeline — 架构详解
- UI改动前置自查 — 改 UI 前流程清单
- [[0067-email-placeholder-phone-probe-403-toggle|0067-email占位符-phone从未写库-probe-403-UI-toggle]] — Phase 1 dogfood #1
- [[0068-dogfood-v0.10.89-4-ux-fixes|0068-dogfood-v0.10.89-清空-mstage显示-phone去重-创建loading]] — Phase 1 dogfood #2
- SPEC-004-网站采集多阶段优化-云端协同 — 完整 spec