扫描类计数 — 「countByQuery 优先」/「警惕 50k 窗口家族 bug」¶

背景：v0.9.35 ~ v0.10.112 之间，同一类 bug 复发 4 次（ISSUE-0018/0050/0051/0074）。都是「selectByQuery({ limit: N, order: { by: 'id', type: 'desc' } }) + 循环计数」。用户全表超 N 时，老数据被新数据从窗口里挤出去，UI 显示 0 / 截断 / 计数错。

必须把它沉淀成强制规则，下次新代码自动避开。

反模式（红信号）— 任何这种写法都要审¶

// ❌ 红信号 1：selectByQuery + limit + order id desc + 循环计数
const rows = await selectByQuery('MapTaskData', {
  limit: 50000,
  order: { by: 'id', type: 'desc' },
});
let scraped = 0;
let withEmail = 0;
for (const r of rows) {
  if (r.scrape_status === 2) scraped++;
  if (r.emails?.length) withEmail++;
}
return { scraped, withEmail };

为什么坏： - 用户全表 > limit 时，最新 limit 行被窗口内 - 新建任务一次 push 几万 scrape_status=0 行 → 窗口被占满 - 已采过的 status=2、emails>0 的老行（id 较小）被挤出窗口 - UI 显示 已采 0 / 邮箱 0 —— 看起来像功能坏了，其实数据在 DB 里

4 次历史： - ISSUE-0018：hasEmailOnly 50k 静默截断 - ISSUE-0050：chip count 截断 50k - ISSUE-0051：商家列表卡顿（5w 行 × 5s polling） - ISSUE-0074：KPI 0 邮箱（187k 用户最新 5w 全 status=0）

正模式（绿信号）¶

优先级 1：用 `countByQuery`（最便宜，毫秒级）¶

// ✅ 三个总数用原生 count — 无 limit、瞬间精确
const [total, scraped, pending] = await Promise.all([
  countByQuery('MapTaskData', {}) as Promise<number>,
  countByQuery('MapTaskData', { scrape_status: 2 }) as Promise<number>,
  countByQuery('MapTaskData', { scrape_status: 0 }) as Promise<number>,
]);

前提：where 字段必须 enableSearch: true（base.ts 检查）。否则 jsstore 会全表扫，反而更慢。

优先级 2：用 `where` 索引字段缩小集合，再扫¶

// ✅ emails 必须扫行才能去重 —— 但先 where 限定到已采的行
// scrape_status=2 的行通常 ≤ 实际产出（远小于全表），不会触发窗口截断
const doneRows = await selectByQuery('MapTaskData', {
  where: { scrape_status: 2 },
  limit: SCAN_CAP,
  order: { by: 'id', type: 'desc' },
});
const emails = new Set();
for (const r of doneRows) for (const e of r.emails || []) emails.add(e);

优先级 3：分窗扫（多种状态都要算）¶

// ✅ status=0 也要扫（phone 字段在地图阶段就有）— 但分两窗
const [done, pending] = await Promise.all([
  selectByQuery(T, { where: { scrape_status: 2 }, limit: 50000, order: {by:'id',type:'desc'}}),
  selectByQuery(T, { where: { scrape_status: 0 }, limit: 50000, order: {by:'id',type:'desc'}}),
]);
// 每窗 5w 互不挤兑，整体能容纳 10w

决策表¶

你要算	用什么	备注
总数 / 某状态的行数	`countByQuery(T, {where})`	字段需 `enableSearch:true`
去重的独立值数（emails / urls / phones）	`selectByQuery(T, {where:状态字段, limit})`	先用 where 缩小到可能含值的行
行平均 / 求和 / max	`selectByQuery + reduce`	若全表超 limit，分窗或分页扫
最新 N 条展示	`selectByQuery({limit:N, order:id desc})`	只有这种情况 id desc + limit 是对的

必问清单（写或改派生计数的代码时）¶

[ ] 这是「展示最新 N 条」（窗口语义正确）还是「派生全量统计」（窗口语义错）？
[ ] 如果是统计：能不能用 countByQuery？where 字段是否 enableSearch:true？
[ ] 如果必须扫行：where 能否缩到"有意义"的子集（status=2 / hasEmail 等）？
[ ] limit 注释里写明："超出 N 后会发生什么、用户如何发现"
[ ] 当 total > limit 时，是否在 UI 上有 truncated 提示？（如 merchant-stats.ts 的 `truncated: total > sample.length`）

当前已知违规处（v0.10.112 时点）¶

文件	行	状态
`src/utils/data-counts.ts`	(重写)	✅ v0.10.112 已修
`src/sections/data/data-view.tsx`	153/158	✅ v0.10.112 已修 — 拆分 status=2 + status=0 双窗
`src/utils/merchant-stats.ts`	80	⚠️ 同款 bug 待修（v0.10.113 候选）
`src/utils/merchant-stats.ts`	113	🟡 taskId 路径全表 + JS filter — 受 ISSUE-0008 制约，单任务 ≤ 5w 通常 OK
`src/sections/page/task-filter-picker.tsx`	52	🟢 task 表小（≤ 10k），100k 窗口够
`src/entrypoints/background/task-manager.ts`	177	🟢 同上

自动化（v0.10.113 加 scan:count-window）¶

scripts/scan-count-window.py 扫 src/ 的危险模式： - selectByQuery + limit + order:id desc 出现在非展示路径（疑似派生计数） - 命中处必须有 // SAFE: count-window-ok 注释豁免，否则 pre-commit 阻止

具体豁免理由必须写明，如： - // SAFE: count-window-ok — 单任务行数 ≤ 5w，taskId where 走索引 - // SAFE: count-window-ok — 用户显式"看最新 N 条"展示语义

教训¶

每一处 limit 都是个时间炸弹 — 注释里不写"超出 N 时丢什么"，下一个用户撞上即炸
复发 4 次仍能再发生 — 没自动化检测 + 没硬规则就一定会
bug 的层级感：用户看到"0 邮箱" → 以为产品坏了；实际数据完好在 DB 和 ContactPool；只是计数代码用了错的查询模式
修复优先级：先修 KPI 让用户立刻看到真数；再写规则让以后不出；最后扫存量找其他违规处