多阶段抓取 pipeline¶

v0.10.87 起（SPEC-004 Phase 1）— 默认关闭，需在 settings 把 enableMultiStageScrape 改 true 启用。 Phase 2 域名状态机将基于本 pipeline 的事件流构建。

为什么要做¶

旧链路 processWebsiteScrape 拿到任何官网 URL 都会直接 scrapeWithTab(url) 开 chrome.tabs 实抓：

单站 ~ 5-10 秒
大批量（如 22 万站）需 ~ 30 天
浪费场景：DNS 死、SSL 过期、404、anti-bot 全拒、真无 contact 的站

多阶段 pipeline 在 tab fallback 之前加 3 道闸：HEAD 探测 → GET fetch → regex 解析。能在客户端轻量层处理掉的 URL 永不开 tab。

决策树¶

                         URL
                          │
              ┌───────────▼───────────┐
              │ ②  HEAD probe (3s)    │  src/utils/website-probe.ts
              └──┬────┬─────┬─────┬───┘
                 │    │     │     │
   dead/404 ◀────┘    │     │     └──▶ ok ─▶ ③
                      │     │
        antibot ◀─────┘     │
        (cf-mitigated)      │
                            │
        retry-later ◀───────┘ (5xx / 429)
                            │
              ┌─────────────▼────────────┐
              │ ③  GET fetch (5s)        │  src/utils/website-fetcher.ts
              └──┬────┬─────┬─────┬──────┘
                 │    │     │     │
   non-html ◀────┘    │     │     └──▶ ok html ─▶ ④
                      │     │
        antibot ◀─────┘     │
        (body markers)      │
                            │
        too-small/fail ◀────┘ (< 1KB / network)
                            │
              ┌─────────────▼────────────┐
              │ ④  Regex extract         │  src/utils/contact-extractor.ts
              └──┬────┬────────────┬─────┘
                 │    │            │
   ≥ 1 contact  ─┘    │            └──▶ 0 contact + 关键词 ─▶ fallback (JS 渲染)
                      │
        success ─────▶│
                      │
                      └──▶ 0 contact + 无关键词 ─▶ skip (真无)

fallback = 落到原 scrapeWithTab 路径。

文件结构¶

文件	行数	职责
`src/utils/website-probe.ts`	~180	HEAD（GET 兜底）+ 响应头分类
`src/utils/website-fetcher.ts`	~180	GET body + Content-Type/长度/关键词三层判断
`src/utils/contact-extractor.ts`	~90	包 `extractDataFromHtml` + 自动加 settings + contact 关键词扫描
`src/utils/website-scrape-pipeline.ts`	~175	决策树组装 + 统一 outcome
`src/utils/scraper-executor.ts`	(改 ~50 行)	feature flag 入口 + `applyPipelineOutcome` 帮助函数

数据结构¶

ProbeResult（HEAD 阶段）¶

type ProbeKind = 'ok' | 'dead' | 'antibot' | 'retry-later';

interface ProbeResult {
  kind: ProbeKind;
  status: number;        // HTTP；0 = 网络层
  reason: string;        // 'dns' | '404' | 'cf-mitigated:challenge' | '403+cf-ray' | '5xx' ...
  server?: string;       // Server header（仅日志，不路由）
  cfRay?: string;
  cfMitigated?: string;
  contentType?: string;
  finalUrl?: string;     // 跟 redirect 后
}

FetchResult（GET 阶段）¶

type FetchKind = 'ok' | 'antibot' | 'too-small' | 'non-html' | 'fail';

interface FetchResult {
  kind: FetchKind;
  status: number;
  html?: string;
  contentType?: string;
  finalUrl?: string;
  reason: string;
  byteLength?: number;
}

PipelineOutcome（统一出口）¶

type PipelineOutcome =
  | { kind: 'success'; data: ContactExtractResult; stage: 'extract'; probe; fetch }
  | { kind: 'skip'; reason: 'dead'|'non-html'|'no-contact'|'retry-later'; stage; detail }
  | { kind: 'fallback'; reason: 'antibot-head'|'antibot-body'|'fetch-fail'|...; stage; detail };

anti-bot 双层判断（详见 SPEC-004 Phase 1.3）¶

HEAD 响应头（强信号，直接判定）： - cf-mitigated: challenge|block — Cloudflare 拦截 - 403 / 503 + cf-ray — CF 拒绝 - 429 — rate limit（归 retry-later 不归 antibot）

GET body 关键词（HEAD 通过但 body 是 challenge 页）：

const CHALLENGE_MARKERS = [
  '/cdn-cgi/challenge-platform', 'cf-browser-verification', 'cf-im-under-attack',
  'cf-challenge-running', 'just a moment', 'checking your browser',
  'enable javascript and cookies', 'window._cf_chl_opt', '__cf_chl_jschl_tk__',
  'sucuri_cloudproxy_uuid', '__incapsula__', 'awswafcaptchacdk', 'aws-waf-token',
];

关键不路由：Server: cloudflare 单独不判 antibot — 70% CF 站是纯 CDN，fetch 正常。仅作日志/stats 标签。

fetch 关键细节¶

redirect: 'follow' — 拿最终 URL（避免跳转链路重复 fetch）
credentials: 'omit' — 不带 cookies，纯探测
Range: bytes=0-2047 — HEAD 405 时落 GET 用，节流量
body 最大 2MB，stream cancel 防大 HTML 撑爆 SW heap
timeout 用 AbortController（fetch 不支持 timeout 选项）

集成点：scraper-executor.ts¶

// processWebsiteScrape 中部插入
const settings = await settingParamsStorageItem.getValue();
if ((settings as any).enableMultiStageScrape === true) {
  const outcome = await runWebsiteScrapePipeline(websiteUrl);
  const handled = await applyPipelineOutcome(id, websiteUrl, outcome);
  if (handled) {
    resetFailCount();
    return { success: true };
  }
  // fallback → 继续走原 scrapeWithTab
}

// 原 tab 路径不动
const { html, status, error } = await scrapeWithTab(websiteUrl);
...

outcome → DB / 日志映射¶

outcome	DB scrape_status	opened (page-log)	备注
`success`	2（已抓）	false	pipeline 命中，未开 tab
`skip/dead`	2（标已抓避免重试）	false	真死站
`skip/no-contact`	2	false	静态 HTML 真无 contact 且无关键词
`skip/non-html`	2	false	返回 PDF / 图片 / 文档
`skip/retry-later`	2	false	5xx/429 — 本期标已抓，Phase 2 域名状态机会有 24h retry
`fallback/*`	不动	false（记 fallback 标签）	落到 scrapeWithTab，再写真实结果

日志可观测性¶

appendPageLog 的 error 字段被复用编码 multi-stage 状态：

前缀	含义
`mstage:success — success: N contacts (...)`	多阶段命中
`mstage:dead@probe — ...`	HEAD 判 dead
`mstage:antibot-head@probe — cf-mitigated:...`	HEAD 反爬
`mstage:antibot-body@fetch — challenge:_cf_chl_opt`	GET body 反爬
`mstage:no-contact@extract — no-keywords-no-data`	静态 HTML 真无 contact
`mstage:fallback/antibot-head@probe`	落 tab 前的一条事件标记

未来 Phase 2 会把 mstage:* 事件流喂给 domainStats 状态机。

Feature Flag¶

SettingParams.enableMultiStageScrape: boolean（默认 false）。

v0.10.87：feature flag 默认关，dogfood 期间手动开
后续版本若 dogfood OK 改默认 true

修改入口（待 UI 加 toggle）：

// chrome devtools console（dist-v2/chrome-mv3 加载后）
chrome.storage.local.get('local:settingParams', (r) => {
  const s = r['local:settingParams'];
  s.enableMultiStageScrape = true;
  chrome.storage.local.set({ 'local:settingParams': s });
});

性能预期（Phase 1 估算）¶

站类型	占比	旧路径耗时	新路径耗时	收益
死站 (DNS/404)	~ 15%	5-10s (tab timeout)	3s (HEAD)	3x
Anti-bot	~ 10%	8-15s (tab + CF wait)	3-8s (HEAD + tab)	1.5x
真无 contact + 无关键词	~ 20%	5-10s (tab + extract)	4-6s (HEAD + GET + extract)	1.5x
静态友好站	~ 40%	5-10s (tab)	3-5s (HEAD + GET + extract)	2x
JS 渲染站	~ 15%	5-10s	5-10s + 3-6s 多阶段（亏）	-1.5x

整体加权 ≈ 1.8x 提速（dogfood 后校准）。

待做（Phase 2 衔接）¶

domainStats 数据结构持久化（域名状态机）
pipeline 事件流喂给 domainStats 自动状态转换
domain 黑名单短路：state==='dead' 直接 skip 不走 pipeline
state==='friendly' 跳过 HEAD 直接 GET（已知友好）