Skip to content

Scan depth

The depth parameter controls how much metadata is extracted from files, folders, datasets, and databases. Set it once as a global default, or override it for any add entry.

yaml
depth: variable                   # global default

add:
  - folder: ./data                # inherits "variable"

  - folder: ./big
    depth: stat                   # override for this entry

  - database: sqlite:///db.sqlite
    depth: dataset
Featuredatasetvariablestatvalue (default)
Folders
Datasets (format, path, mtime)
Variables (names, types)
DB introspection (PK, FK, comments)
Row count, statistics
Enumerations, frequencies, patterns
Auto-tagging (format, security, text)

Note: At depth="variable", CSV and Excel files only extract column names (types require reading data, available from depth="stat"). All other formats provide types at this level.

Typical use cases:

  • dataset — quick inventory of available files/tables without reading data
  • variable — lightweight schema discovery (column names and types)
  • stat — data profiling without enumeration detection (faster than value)
  • value — full catalog with frequency tables and enumeration assignment (default)

Auto-tagging

At depth="value" (default), string columns are automatically tagged by content type. Tags use a two-level hierarchy under the auto parent:

CategoryTags
Formatauto---email, auto---phone, auto---uuid, auto---iban
Securityauto---bcrypt, auto---argon2, auto---jwt, auto---secret
Textauto---structured, auto---semi-structured, auto---free-textauto---natural-text

Each variable receives at most one leaf tag. The frontend can use parent_id to filter by category (e.g. selecting auto---security shows all bcrypt/argon2/jwt/secret variables).

Security-tagged columns (bcrypt, argon2, jwt, secret) have their raw frequency values suppressed — only pattern frequencies are emitted, so no actual secrets appear in the exported catalog.

Policy tags

Policy tags let you manually control scan behavior for specific variables. Like auto-tags, they live under the scan hierarchy and are auto-created — no tag.csv entry needed.

TagEffect
policy---frequency-hiddenSuppress all frequency and enumeration data (stats remain visible)

Assign the tag in your variable.csv metadata:

csv
id,tag_ids
source---employees_csv---first_name,"policy---frequency-hidden"

This is useful for sensitive columns that auto-tagging does not flag (e.g. first names, internal IDs, business codes).