Crawl - Screaming Frog API

Class constructors

All constructors are class methods on Crawl. Use Crawl.load() for auto-detected loading, or the named constructors for explicit control.

`Crawl.load()`

Auto-detect and load a crawl from any supported source.

from screamingfrog import Crawl

crawl = Crawl.load("./exports")          # CSV exports directory
crawl = Crawl.load("./crawl.db")         # SQLite database
crawl = Crawl.load("./crawl.duckdb")     # DuckDB analytics cache
crawl = Crawl.load("./crawl.dbseospider") # Derby-backed crawl (default: DuckDB analysis)
crawl = Crawl.load("./crawl.seospider")  # Screaming Frog crawl file
crawl = Crawl.load("<uuid>", source_type="db_id")  # DB crawl ID

path

str

required

Path to the crawl source. Accepts a directory, file path, or DB crawl UUID. Auto-detection is based on path suffix and directory contents.

source_type

str

default:"auto"

Force a specific loader. One of "auto", "exports", "csv", "duckdb", "sqlite", "db", "derby", "dbseospider", "seospider", "db_id".

seospider_backend

str

default:"duckdb"

Backend to use when loading .seospider files. One of "duckdb", "derby", "csv".

db_id_backend

str

default:"duckdb"

Backend to use when loading by DB crawl ID. One of "duckdb", "derby", "csv".

dbseospider_backend

str

default:"duckdb"

Backend to use when loading .dbseospider files. One of "duckdb", "derby".

duckdb_path

str | None

default:"None"

Path for the DuckDB analytics cache. Defaults to a sibling file next to the source.

duckdb_namespace

str | None

default:"None"

Namespace to use within a multi-crawl DuckDB file.

duckdb_tabs

Sequence[str] | str | None

default:"None"

Tabs to materialize into the DuckDB cache. Pass "all" to materialize every mapped tab.

duckdb_if_exists

str

default:"auto"

Cache refresh strategy. "auto" rebuilds only when the Derby source changed. Also accepts "replace" or "skip".

materialize_dbseospider

bool

default:"True"

Whether to create a .dbseospider sidecar file when loading .seospider crawls.

csv_fallback

bool

default:"True"

Enable automatic CSV export fallback for Derby-backed crawls when a tab or column is missing.

export_tabs

Sequence[str] | None

default:"None"

Tabs to export when using CLI-backed loaders.

export_profile

str | None

default:"None"

Named export profile. Use "kitchen_sink" for the bundled full-tab profile.

Returns Crawl

`Crawl.from_exports()`

Load from a directory of CSV export files.

crawl = Crawl.from_exports("./exports")

export_dir

str

required

Path to the directory containing exported .csv files.

Returns Crawl

`Crawl.from_database()`

Load from a SQLite database file (legacy backend, limited tab support).

crawl = Crawl.from_database("./crawl.db")

db_path

str

required

Path to the SQLite .db or .sqlite file.

Returns Crawl

`Crawl.from_duckdb()`

Load from a DuckDB analytics cache file.

crawl = Crawl.from_duckdb("./crawl.duckdb")
crawl = Crawl.from_duckdb("./portfolio.duckdb", namespace="client-a")

db_path

str

required

Path to the .duckdb file.

namespace

str | None

default:"None"

Namespace to read within a multi-crawl DuckDB file.

Returns Crawl

`Crawl.duckdb_namespaces()`

List all crawl namespaces stored in a DuckDB file.

namespaces = Crawl.duckdb_namespaces("./portfolio.duckdb")

db_path

str

required

Path to the .duckdb file.

Returns list[str]

`Crawl.from_derby()`

Load directly from a Derby (.dbseospider) database.

crawl = Crawl.from_derby("./crawl.dbseospider")
crawl = Crawl.from_derby("./crawl.dbseospider", backend="derby", csv_fallback=False)

db_path

str

required

Path to the Derby database directory or .dbseospider archive.

backend

str

default:"duckdb"

Analysis backend. "duckdb" (default) promotes to a DuckDB analytics cache; "derby" queries Derby directly.

duckdb_path

str | None

default:"None"

Path for the DuckDB cache. Defaults to a sibling file next to the source.

duckdb_namespace

str | None

default:"None"

Namespace for the DuckDB cache.

duckdb_tables

Sequence[str] | None

default:"None"

Raw Derby tables to export into DuckDB.

duckdb_tabs

Sequence[str] | str | None

default:"None"

Mapped tabs to materialize into DuckDB. Use "all" for every available tab.

duckdb_if_exists

str

default:"auto"

Cache refresh strategy. "auto", "replace", or "skip".

csv_fallback

bool

default:"True"

Fall back to CLI CSV exports for tabs or columns unavailable in Derby.

Returns Crawl

`Crawl.from_seospider()`

Load from a Screaming Frog .seospider crawl file. Runs the Screaming Frog CLI internally.

crawl = Crawl.from_seospider("./crawl.seospider")
crawl = Crawl.from_seospider(
    "./crawl.seospider",
    backend="csv",
    export_dir="./exports",
    export_tabs=["Internal:All", "Response Codes:All"],
)

crawl_path

str

required

Path to the .seospider file.

backend

str

default:"duckdb"

Backend to use. One of "duckdb", "derby", "csv".

materialize_dbseospider

bool

default:"True"

Create a .dbseospider sidecar archive next to the source crawl.

dbseospider_overwrite

bool

default:"True"

Overwrite an existing .dbseospider sidecar.

ensure_db_mode

bool

default:"True"

Temporarily set storage.mode=DB in spider.config before loading.

export_tabs

Sequence[str] | None

default:"None"

Tabs to export when using the CSV backend.

export_profile

str | None

default:"None"

Named export profile (e.g. "kitchen_sink").

Returns Crawl

`Crawl.from_db_id()`

Load a DB-mode crawl by its UUID from the local ProjectInstanceData directory.

crawl = Crawl.from_db_id("138edb21-61d0-41cd-9e9b-725b592a471c")

crawl_id

str

required

The UUID of the DB-mode crawl folder inside ProjectInstanceData.

backend

str

default:"duckdb"

Backend to use. One of "duckdb", "derby", "csv".

project_root

str | None

default:"None"

Override the ProjectInstanceData root directory. Defaults to the standard Screaming Frog data path.

Returns Crawl

Views and queries

`crawl.internal`

Sitewide internal page view. Returns an InternalView object backed by the internal page model.

for page in crawl.internal.filter(status_code=404):
    print(page.address)

Type InternalView

`crawl.pages()`

Sitewide page view backed by the internal model. Use .filter() and .select() to narrow results.

pages = crawl.pages().filter(status_code=404).collect()
lightweight = crawl.pages().select("Address", "Status Code", "Title 1").collect()

Returns PageView

`crawl.links()`

Sitewide inlinks or outlinks view.

inlinks = crawl.links("in").filter(status_code=404).collect()
outlinks = crawl.links("out").collect()

direction

str

default:"out"

Link direction. "in" for inlinks, "out" for outlinks.

Returns LinkView

`crawl.tab()`

Access any export tab by name (CSV filename without extension, or normalized name).

for row in crawl.tab("response_codes_all"):
    print(row["Address"], row["Status Code"])

for row in crawl.tab("page_titles").filter(gui="Missing"):
    print(row["Address"])

name

str

required

Tab name. Case-insensitive; snake_case and title-case forms accepted. Extension optional.

Returns TabView

`crawl.section()`

Scope page and link views to a URL path prefix or full URL prefix.

blog = crawl.section("/blog")
blog_pages = blog.pages().collect()
blog_inlinks = blog.links("in").collect()
blog_tab = blog.tab("all_inlinks").collect()

prefix

str

required

URL path prefix (e.g. "/blog") or full URL prefix (e.g. "https://example.com/blog").

Returns CrawlSection

`crawl.search()`

Search across the sitewide page view.

matches = crawl.search("canonical", fields=["Address", "Title 1"]).collect()

term

str

required

Search string.

fields

Sequence[str] | None

default:"None"

Limit search to these column names. Searches all string fields when None.

case_sensitive

bool

default:"False"

Whether the search is case-sensitive.

Returns SearchRowView

`crawl.tabs`

List available tab names for the current backend.

print(crawl.tabs)

Type list[str]

`crawl.query()`

Build a chainable SQL query against a raw backend table (DB-backed crawls only).

rows = (
    crawl.query("APP", "URLS")
    .select("ENCODED_URL", "RESPONSE_CODE")
    .where("RESPONSE_CODE >= ?", 400)
    .order_by("RESPONSE_CODE DESC")
    .limit(100)
    .collect()
)

schema

str

required

Schema name (e.g. "APP").

table

str

required

Table name (e.g. "URLS").

Returns QueryView

`crawl.raw()`

Yield raw rows from a backend table as dicts. DB-backed crawls only.

for row in crawl.raw("APP.URLS"):
    print(row["ENCODED_URL"], row["RESPONSE_CODE"])

table

str

required

Fully qualified table name (e.g. "APP.URLS").

Returns Iterator[dict[str, Any]]

`crawl.sql()`

Execute a raw SQL query and yield rows as dicts. DB-backed crawls only.

for row in crawl.sql(
    "SELECT ENCODED_URL, RESPONSE_CODE FROM APP.URLS WHERE RESPONSE_CODE >= ?",
    [400],
):
    print(row)

query

str

required

SQL query string. Use ? for parameterized values.

params

Sequence[Any] | None

default:"None"

Query parameters corresponding to ? placeholders.

Returns Iterator[dict[str, Any]]

Graph helpers

`crawl.inlinks()`

Return all inlinks for a given URL.

for link in crawl.inlinks("https://example.com/page"):
    print(link.source, link.anchor_text)

url

str

required

The destination URL to look up inlinks for.

Returns Iterator[Link]

`crawl.outlinks()`

Return all outlinks from a given URL.

for link in crawl.outlinks("https://example.com/page"):
    print(link.destination)

url

str

required

The source URL to look up outlinks for.

Returns Iterator[Link]

Chain helpers

`crawl.redirect_chains()`

Iterate redirect chain rows, optionally filtered by hop count and loop flag.

for row in crawl.redirect_chains(min_hops=3, loop=False):
    print(row["Address"], row["Number of Redirects"])

min_hops

int | None

default:"None"

Minimum number of redirect hops. None means no lower bound.

max_hops

int | None

default:"None"

Maximum number of redirect hops. None means no upper bound.

loop

bool | None

default:"None"

Filter by loop status. True returns only loops; False excludes loops; None returns all.

Returns Iterator[dict[str, Any]]

`crawl.canonical_chains()`

Iterate canonical chain rows.

min_hops

int | None

default:"None"

Minimum number of canonical hops.

max_hops

int | None

default:"None"

Maximum number of canonical hops.

loop

bool | None

default:"None"

Filter by loop status.

Returns Iterator[dict[str, Any]]

`crawl.redirect_and_canonical_chains()`

Iterate mixed redirect and canonical chain rows.

min_hops

int | None

default:"None"

Minimum total hops.

max_hops

int | None

default:"None"

Maximum total hops.

loop

bool | None

default:"None"

Filter by loop status.

Returns Iterator[dict[str, Any]]

Audit report helpers

All report helpers return a flat list[dict[str, Any]] of issue rows, ready to export or load into a dataframe.

`crawl.summary()`

Return a compact crawl-level summary dict with counts for pages, broken links, orphans, redirect chains, and issue families.

print(crawl.summary())

Core counts (pages, tabs, broken_pages) are always populated. Issue-family and chain totals may be None on lean DuckDB caches until those tabs are materialized.

Returns dict[str, Any]

`crawl.broken_links_report()`

Return broken internal URLs with inlink counts and sampled inlink sources.

min_status

int

default:"400"

Minimum HTTP status code to include.

max_status

int

default:"599"

Maximum HTTP status code to include.

max_inlinks

int | None

default:"25"

Maximum number of sampled inlink sources per broken URL. Pass None to include all.

Returns list[dict[str, Any]]

`crawl.broken_inlinks_report()`

Return sitewide inlinks pointing to broken destinations.

min_status

int

default:"400"

Minimum HTTP status code.

max_status

int

default:"599"

Maximum HTTP status code.

Returns list[dict[str, Any]]

`crawl.nofollow_inlinks_report()`

Return sitewide inlinks marked as nofollow. Returns list[dict[str, Any]]

`crawl.title_meta_audit()`

Return page-level rows for missing titles and missing meta descriptions. Returns list[dict[str, Any]]

`crawl.indexability_audit()`

Return non-indexable pages with key indexability fields (Indexability, Indexability Status, Canonical, Meta Robots, X-Robots-Tag). Returns list[dict[str, Any]]

`crawl.orphan_pages_report()`

Return pages with no incoming internal links.

ignore_self_links

bool

default:"True"

Exclude self-referencing links when computing inlink counts.

only_indexable

bool

default:"False"

Return only indexable orphan pages.

Returns list[dict[str, Any]]

`crawl.security_issues_report()`

Return rows from all available security issue tabs (missing HSTS, CSP, mixed content, insecure forms, etc.). Returns list[dict[str, Any]]

`crawl.canonical_issues_report()`

Return rows from all available canonical issue tabs (missing, multiple, conflicting, non-indexable, etc.). Returns list[dict[str, Any]]

`crawl.hreflang_issues_report()`

Return rows from all available hreflang issue tabs. Returns list[dict[str, Any]]

`crawl.redirect_issues_report()`

Return rows from available redirect issue tabs (redirect chains, loops, meta refresh, JS redirect). Returns list[dict[str, Any]]

`crawl.redirect_chain_report()`

Collected version of crawl.redirect_chains(). Returns results as a list.

min_hops

int | None

default:"None"

Minimum redirect hops.

max_hops

int | None

default:"None"

Maximum redirect hops.

loop

bool | None

default:"None"

Filter by loop status.

Returns list[dict[str, Any]]

Tab metadata

`crawl.tab_filters()`

List available GUI filter names for a tab.

print(crawl.tab_filters("Page Titles"))
# ['Missing', 'Duplicate', 'Over 60 Characters', ...]

name

str

required

Tab name.

Returns list[str]

`crawl.tab_filter_defs()`

Return the full filter definition objects for a tab.

name

str

required

Tab name.

Returns list[Any]

`crawl.tab_columns()`

Return the column names for a tab.

print(crawl.tab_columns("page_titles"))

name

str

required

Tab name.

Returns list[str]

`crawl.describe_tab()`

Return a dict with tab, columns, and filters for a given tab name.

info = crawl.describe_tab("page_titles")
print(info["columns"], info["filters"])

name

str

required

Tab name.

Returns dict[str, Any]

DuckDB export

`crawl.export_duckdb()`

Export the current crawl into a DuckDB analytics cache file.

crawl.export_duckdb("./crawl.duckdb", if_exists="auto")

# Export into a shared portfolio file with a namespace
crawl.export_duckdb("./portfolio.duckdb", namespace="client-a", if_exists="auto")

# Materialize all mapped tabs
crawl.export_duckdb("./crawl.duckdb", tabs="all")

path

str

required

Destination path for the DuckDB file.

tables

Sequence[str] | None

default:"None"

Raw Derby tables to include.

tabs

Sequence[str] | str | None

default:"None"

Mapped tabs to materialize. Pass "all" for every available tab.

if_exists

str

default:"replace"

What to do when the cache already exists. One of "replace", "skip", "auto".

source_label

str | None

default:"None"

Label stored in the cache to identify the crawl source.

namespace

str | None

default:"None"

Namespace within the DuckDB file for multi-crawl storage.

Returns Path

`export_duckdb_from_backend()`

Export a crawl backend directly to a DuckDB file (lower-level than crawl.export_duckdb()). Used internally; exposed for advanced workflows.

from screamingfrog import export_duckdb_from_backend

backend

CrawlBackend

required

A crawl backend instance.

duckdb_path

str | Path

required

Destination path for the DuckDB file.

tables

Sequence[str] | None

default:"None"

Raw Derby tables to export. Defaults to DEFAULT_DUCKDB_TABLES.

tabs

Sequence[str] | str | None

default:"None"

Mapped tabs to materialize.

if_exists

str

default:"replace"

Cache refresh strategy: "replace", "skip", or "auto".

source_label

str | None

default:"None"

Label stored in the cache to identify the crawl source.

namespace

str | None

default:"None"

Namespace within the DuckDB file.

Returns Path

Exported constants

`DEFAULT_DUCKDB_TABLES`

The default set of raw Derby tables exported when creating a DuckDB cache without specifying tables.

from screamingfrog import DEFAULT_DUCKDB_TABLES

print(DEFAULT_DUCKDB_TABLES)
# ('APP.URLS', 'APP.LINKS', 'APP.UNIQUE_URLS')

Type tuple[str, ...]

`DEFAULT_DUCKDB_TABS`

The default set of mapped tabs materialized when creating a DuckDB cache without specifying tabs.

from screamingfrog import DEFAULT_DUCKDB_TABS

print(DEFAULT_DUCKDB_TABS)
# ('internal_all', 'all_inlinks', 'all_outlinks', 'redirect_chains',
#  'canonical_chains', 'redirect_and_canonical_chains')

Type tuple[str, ...]

Crawl comparison

`crawl.compare()`

Compare two crawls and return structural changes as a CrawlDiff.

old = Crawl.load("./crawl-2024-01.dbseospider")
new = Crawl.load("./crawl-2024-02.dbseospider")

diff = new.compare(old)
print(diff.summary())

for change in diff.status_changes:
    print(change.url, change.old_status, "->", change.new_status)

other

Crawl

required

The baseline crawl to compare against.

title_fields

Sequence[str] | None

default:"None"

Field names to use for title comparison. Defaults to ("Title 1", "Title").

redirect_fields

Sequence[str] | None

default:"None"

Field names for redirect URL comparison. Defaults to ("Redirect URL", "Redirect URI", "Redirect Destination").

redirect_type_fields

Sequence[str] | None

default:"None"

Field names for redirect type comparison. Defaults to ("Redirect Type",).

field_groups

dict[str, Sequence[str]] | None

default:"None"

Additional field groups to diff (canonical, meta description, H1-3, word count, indexability, robots directives). Pass a custom dict to override the defaults.

Returns CrawlDiff

Top-level helpers

`list_crawls()`

Enumerate all DB-mode crawls in the local ProjectInstanceData directory without opening Derby.

from screamingfrog import list_crawls

for info in list_crawls():
    print(info.db_id, info.url, info.urls_crawled, info.modified)

latest = list_crawls()[0]
crawl = Crawl.load(latest.db_id, source_type="db_id")

project_root

str | None

default:"None"

Override the ProjectInstanceData root directory.

Returns list[CrawlInfo]

`export_duckdb_from_derby()`

Export a Derby crawl to a DuckDB file directly (without creating a Crawl instance).

from screamingfrog import export_duckdb_from_derby

export_duckdb_from_derby("./crawl.dbseospider", "./crawl.duckdb", if_exists="auto")

db_path

str

required

Path to the Derby database directory or .dbseospider file.

duckdb_path

str

required

Destination path for the DuckDB file.

tables

Sequence[str] | None

default:"None"

Raw Derby tables to export.

tabs

Sequence[str] | None

default:"None"

Mapped tabs to materialize.

if_exists

str

default:"auto"

Cache refresh strategy.

Returns Path

`export_duckdb_from_db_id()`

Export a DB-mode crawl by ID to a DuckDB file.

from screamingfrog import export_duckdb_from_db_id

export_duckdb_from_db_id(
    "138edb21-61d0-41cd-9e9b-725b592a471c",
    "./crawl.duckdb",
    if_exists="auto",
)

db_id

str

required

The DB crawl UUID.

duckdb_path

str

required

Destination path for the DuckDB file.

tables

Sequence[str] | None

default:"None"

Raw Derby tables to export.

tabs

Sequence[str] | None

default:"None"

Mapped tabs to materialize.

if_exists

str

default:"auto"

Cache refresh strategy.

Returns Path

Documentation Index

​Class constructors

​Crawl.load()

​Crawl.from_exports()

​Crawl.from_database()

​Crawl.from_duckdb()

​Crawl.duckdb_namespaces()

​Crawl.from_derby()

​Crawl.from_seospider()

​Crawl.from_db_id()

​Views and queries

​crawl.internal

​crawl.pages()

​crawl.links()

​crawl.tab()

​crawl.section()

​crawl.search()

​crawl.tabs

​crawl.query()

​crawl.raw()

​crawl.sql()

​Graph helpers

​crawl.inlinks()

​crawl.outlinks()

​Chain helpers

​crawl.redirect_chains()

​crawl.canonical_chains()

​crawl.redirect_and_canonical_chains()

​Audit report helpers

​crawl.summary()

​crawl.broken_links_report()

​crawl.broken_inlinks_report()

​crawl.nofollow_inlinks_report()

​crawl.title_meta_audit()

​crawl.indexability_audit()

​crawl.orphan_pages_report()

​crawl.security_issues_report()

​crawl.canonical_issues_report()

​crawl.hreflang_issues_report()

​crawl.redirect_issues_report()

​crawl.redirect_chain_report()

​Tab metadata

​crawl.tab_filters()

​crawl.tab_filter_defs()

​crawl.tab_columns()

​crawl.describe_tab()

​DuckDB export

​crawl.export_duckdb()

​export_duckdb_from_backend()

​Exported constants

​DEFAULT_DUCKDB_TABLES

​DEFAULT_DUCKDB_TABS

​Crawl comparison

​crawl.compare()

​Top-level helpers

​list_crawls()

​export_duckdb_from_derby()

​export_duckdb_from_db_id()

Class constructors

`Crawl.load()`

`Crawl.from_exports()`

`Crawl.from_database()`

`Crawl.from_duckdb()`

`Crawl.duckdb_namespaces()`

`Crawl.from_derby()`

`Crawl.from_seospider()`

`Crawl.from_db_id()`

Views and queries

`crawl.internal`

`crawl.pages()`

`crawl.links()`

`crawl.tab()`

`crawl.section()`

`crawl.search()`

`crawl.tabs`

`crawl.query()`

`crawl.raw()`

`crawl.sql()`

Graph helpers

`crawl.inlinks()`

`crawl.outlinks()`

Chain helpers

`crawl.redirect_chains()`

`crawl.canonical_chains()`

`crawl.redirect_and_canonical_chains()`

Audit report helpers

`crawl.summary()`

`crawl.broken_links_report()`

`crawl.broken_inlinks_report()`

`crawl.nofollow_inlinks_report()`

`crawl.title_meta_audit()`

`crawl.indexability_audit()`

`crawl.orphan_pages_report()`

`crawl.security_issues_report()`

`crawl.canonical_issues_report()`

`crawl.hreflang_issues_report()`

`crawl.redirect_issues_report()`

`crawl.redirect_chain_report()`

Tab metadata

`crawl.tab_filters()`

`crawl.tab_filter_defs()`

`crawl.tab_columns()`

`crawl.describe_tab()`

DuckDB export

`crawl.export_duckdb()`

`export_duckdb_from_backend()`

Exported constants

`DEFAULT_DUCKDB_TABLES`

`DEFAULT_DUCKDB_TABS`

Crawl comparison

`crawl.compare()`

Top-level helpers

`list_crawls()`

`export_duckdb_from_derby()`

`export_duckdb_from_db_id()`