hefftools.dev
Infrastructure-focused data utilities

CT Log Paging: Short Reads, Retries, Backoff, and Contiguous Indices

CT logs use index-range retrieval. The API looks simple. The failure modes are not.

Start by anchoring on tree size

Fetching by date usually starts by calling get-sth to obtain the current tree_size. Your ingestion window becomes an index range.

GET /ct/v1/get-sth  →  {"tree_size": 123456789, ...}

A common strategy is “last N entries”: compute end = tree_size - 1 and start = max(0, end - cap + 1).

Retries: 429 and 5xx are normal

CT operators rate limit. Networks fail. Servers throw 5xx. If you treat these as rare, your pipeline will break.

  • Retry on 429 and 5xx
  • Retry on transport/timeout exceptions
  • Use exponential backoff
Best practice: record retries and the last URL so you can debug operator-specific behavior without guessing.

Short reads: the API can return fewer entries than requested

You request start..end and expect (end-start+1) entries. Sometimes you get fewer. This is a short read.

  • Accept the page (don’t discard it)
  • Record it as a short read
  • Advance by the count actually returned
Hard failure is zero entries. If the server returns an empty page for a non-empty range, you can’t make progress safely.

Contiguous indices: assign indices explicitly

The CT API returns an array of entries without explicit indices. If you want deterministic replay, you must assign indices as idx = cur + i while writing.

This also protects you from oddities like a server returning more than requested — you bound by end_index.

Rate limiting: simple RPS pacing beats “hope”

If you don’t pace requests, you’ll self-induce 429s. A simple sleep(1/rps) per request works well. If rps=0, treat it as “no pacing.”

If you don’t want to maintain this, that’s the point

ct-cert-feed exists to turn these operator-specific ingestion headaches into a stable bulk artifact: records.jsonl.gz + stats.json, replayable by date.