URLscan.io FETCHER

Name: URLscan.io Fetcher Data Collector | Firecrawl Prometheus
Creator: troycarboni
Published: 2026-06-11T23:23:03.637Z
License: https://opensource.org/licenses/MIT

v1Published

Fetch 100 of the freshest URLs from the urlscan.io public api!

Output & API

Preview the latest data, download it, or call this collector as an API.

Author's sample data

urls
count	100
source	urlscan.io public scan feed
fetchedAt	2026-06-11T23:21:09.752Z

Marketplace

Publish this collector so others can deploy it — you keep ownership.

1 subscriber

troycarboni@troycarboni

0 runs in 14d · published 6w ago

Versions

Every build and self-heal appends a version. Pin one to lock runs to it.

managed by author

v1builtapprovedcurrent6w ago

How this script collects data

import Firecrawl from "@mendable/firecrawl-js";
import * as cheerio from "cheerio";

const apiKey = process.env.FIRECRAWL_API_KEY;
if (!apiKey) {
  console.error("FIRECRAWL_API_KEY is not set");
  process.exit(1);
}
const firecrawl = new Firecrawl({ apiKey });

const SEARCH_URL = "https://urlscan.io/api/v1/search/?q=*&size=100";

interface UrlscanResult {
  task?: { url?: string; domain?: string; time?: string };
  page?: { url?: string; domain?: string };
}

function parseJsonBody(rawHtml: string): { results?: UrlscanResult[] } {
  // The endpoint returns raw JSON; Firecrawl may wrap it in an HTML shell.
  const direct = rawHtml.trim();
  if (direct.startsWith("{")) {
    return JSON.parse(direct);
  }
  const $ = cheerio.load(rawHtml);
  const text = $("pre").text().trim() || $("body").text().trim();
  const start = text.indexOf("{");
  if (start === -1) {
    throw new Error("no JSON object found in urlscan.io search response");
  }
  return JSON.parse(text.slice(start));
}

async function main() {
  console.error(`Fetching latest public scans from ${SEARCH_URL}`);
  const doc = await firecrawl.scrape(SEARCH_URL, {
    formats: ["rawHtml"],
    integration: "prometheus",
  });
  const rawHtml = doc.rawHtml;
  if (!rawHtml) {
    throw new Error("urlscan.io search response had no content");
  }
  const data = parseJsonBody(rawHtml);
  if (!Array.isArray(data.results)) {
    throw new Error("urlscan.io response is missing the 'results' array");
  }

  const urls = data.results
    .map((r) => ({
      url: r.task?.url ?? r.page?.url ?? "",
      domain: r.task?.domain ?? r.page?.domain ?? "",
      date: r.task?.time ?? "",
    }))
    .filter((r) => r.url !== "");

  if (urls.length === 0) {
    throw new Error("no scan results extracted from urlscan.io response");
  }
  console.error(`Extracted ${urls.length} newly scanned URLs`);

  const out = {
    source: "urlscan.io public scan feed",
    fetchedAt: new Date().toISOString(),
    count: urls.length,
    urls,
  };
  process.stdout.write(JSON.stringify(out));
}

main().catch((err) => {
  console.error(err);
  process.exit(1);
});

deploy to unlock

Deploy this collector to unlock schedules, the API endpoint, and destinations.