Skip to content

Python API

Catalog

python
Catalog(
    *,
    app_path=None,
    metadata_path=None,
    depth="value",
    refresh=False,
    freq_threshold=100,
    csv_encoding=None,
    sample_size=100_000,
    preview_rows=100,
    csv_skip_copy=False,
    app_config=None,
    quiet=False,
    verbose=False,
    log_file=None,
)
AttributeTypeDescription
app_pathstr | Path | NoneLoad existing catalog for incremental scan
metadata_pathstr | Path | list[str | Path] | NoneMetadata source folder, database URI, or list of sources
depth"dataset" | "variable" | "stat" | "value"Default scan depth (default: "value")
refreshboolForce full rescan ignoring cache (default: False)
freq_thresholdintMax distinct values for frequency/enumeration detection. Strings above this threshold get pattern frequencies instead
csv_encodingstr | NoneDefault CSV encoding (utf-8, cp1252, etc.)
sample_sizeint | NoneDefault sample size for frequency/enumeration detection (default: 100_000)
preview_rowsint | Literal[False]Default max rows exported in dataset previews at stat/value depth (default: 100; 0 or false disables)
csv_skip_copyboolSkip UTF-8 temp copy for local CSV (default: False)
app_configdict[str, str] | NoneKey-value config for the web app
quietboolSuppress progress logging (default: False)
verboseboolShow full tracebacks on errors (default: False)
log_filestr | Path | NoneWrite full scan log to file (truncated each run)
folderTable[Folder]Folder table (.all(), .count, .get_by(...))
datasetTable[Dataset]Dataset table
variableTable[Variable]Variable table
enumerationTable[Enumeration]Enumeration table
valueTable[Value]Enumeration value table
frequencyTable[Frequency]Frequency table (computed)
organizationTable[Organization]Organization table
tagTable[Tag]Tag table
docTable[Doc]Document table
conceptTable[Concept]Business glossary concept table
configTable[Config]Web app config key-value table

Catalog.add_folder()

python
catalog.add_folder(
    path,
    metadata=None,
    *,
    depth=None,
    include=None,
    exclude=None,
    recursive=True,
    csv_encoding=None,
    sample_size=None,
    preview_rows=None,
    csv_skip_copy=None,
    storage_options=None,
    refresh=None,
    quiet=None,
    time_series=True,
    create_folders=True,
    on_unmatched="warn",
)
ParameterTypeDefaultDescription
pathstr | Path | list[str | Path]requiredDirectory or list of directories to scan
metadataEntityMetadata | NoneNoneIdentity, parent linkage, and metadata for the root folder
depth"dataset" | "variable" | "stat" | "value" | NoneNoneScan depth (uses catalog.depth if None)
includelist[str] | NoneNoneGlob patterns to include
excludelist[str] | NoneNoneGlob patterns to exclude
recursiveboolTrueScan subdirectories
csv_encodingstr | NoneNoneOverride CSV encoding
sample_sizeint | NoneNoneSample rows for frequency/enumeration detection (overrides catalog)
preview_rowsint | Literal[False] | NoneNoneMax preview rows for datasets found in this folder (overrides catalog; 0 or false disables)
csv_skip_copybool | NoneNoneSkip UTF-8 temp copy (overrides catalog)
storage_optionsdict | NoneNoneOptions for remote storage (passed to fsspec)
refreshbool | NoneNoneForce rescan (overrides catalog setting)
quietbool | NoneNoneOverride catalog quiet setting
time_seriesboolTrueGroup files with temporal patterns
create_foldersboolTrueIf False, do not create folders from disk; rely on metadata_path for structure (metadata-first)
on_unmatched"skip" | "warn" | "error""warn"Policy when a scanned file has no metadata match (only when create_folders=False)

Catalog.add_dataset()

python
catalog.add_dataset(
    path,
    *,
    metadata=None,
    depth=None,
    csv_encoding=None,
    sample_size=None,
    preview_rows=None,
    csv_skip_copy=None,
    storage_options=None,
    refresh=None,
    quiet=None,
)
ParameterTypeDefaultDescription
pathstr | Path | list[str | Path]requiredFile(s) or partitioned directory (local/remote)
metadataEntityMetadata | NoneNoneDataset identity, parent linkage, and metadata
depth"dataset" | "variable" | "stat" | "value" | NoneNoneScan depth (uses catalog.depth if None)
csv_encodingstr | NoneNoneOverride CSV encoding
sample_sizeint | NoneNoneSample rows for frequency/enumeration detection (overrides catalog)
preview_rowsint | Literal[False] | NoneNoneMax preview rows for this dataset (overrides catalog; 0 or false disables)
csv_skip_copybool | NoneNoneSkip UTF-8 temp copy (overrides catalog)
storage_optionsdict | NoneNoneOptions for remote storage (passed to fsspec)
refreshbool | NoneNoneForce rescan (overrides catalog setting)
quietbool | NoneNoneOverride catalog quiet setting

Catalog.add_database()

python
catalog.add_database(
    connection,
    metadata=None,
    *,
    depth=None,
    schema=None,
    include=None,
    exclude=None,
    sample_size=None,
    preview_rows=None,
    group_by_prefix=True,
    prefix_min_tables=2,
    time_series=True,
    storage_options=None,
    refresh=None,
    quiet=None,
    oracle_client_path=None,
    ssh_tunnel=None,
)
ParameterTypeDefaultDescription
connectionstr | ibis.BaseBackendrequiredConnection string or ibis backend object
metadataEntityMetadata | NoneNoneIdentity, parent linkage, and metadata for the root folder
depth"dataset" | "variable" | "stat" | "value" | NoneNoneScan depth (uses catalog.depth if None)
schemastr | list[str] | NoneNoneSchema(s) to scan
includelist[str] | NoneNoneGlob patterns matched against table names to include
excludelist[str] | NoneNoneGlob patterns matched against table names to exclude
sample_sizeint | NoneNoneSample rows for frequency/enumeration detection (overrides catalog)
preview_rowsint | Literal[False] | NoneNoneMax preview rows for scanned table datasets (overrides catalog; 0 or false disables)
group_by_prefixbool | strTrueGroup tables by prefix into subfolders
prefix_min_tablesint2Min tables to form a prefix group
time_seriesboolTrueDetect temporal table patterns
storage_optionsdict | NoneNoneOptions for remote SQLite/GeoPackage
refreshbool | NoneNoneForce rescan (overrides catalog setting)
quietbool | NoneNoneOverride catalog quiet setting
oracle_client_pathstr | NoneNonePath to Oracle Instant Client libraries
ssh_tunneldict | NoneNoneSSH tunnel config (host, user, port, etc.)

Catalog.export_db()

python
catalog.export_db(
    output_dir=None,
    *,
    track_evolution=True,
    copy_assets=None,
    base_dir=None,
    quiet=None,
)
ParameterTypeDefaultDescription
output_dirstr | Path | NoneNoneOutput directory (uses app_path if None)
track_evolutionboolTrueTrack changes between exports
copy_assetsdict | list[dict] | NoneNoneCopy extra local files/directories into the export using the same rules as copy_assets()
base_dirstr | Path | NoneNoneBase directory for relative copy_assets.from paths (defaults to current working directory)
quietbool | NoneNoneOverride catalog quiet setting

Exports JSON metadata files. Calls finalize() automatically when data has been scanned.

Catalog.export_app()

python
catalog.export_app(
    output_dir=None,
    *,
    open_browser=False,
    track_evolution=True,
    update_app=False,
    copy_assets=None,
    base_dir=None,
    quiet=None,
)
ParameterTypeDefaultDescription
output_dirstr | Path | NoneNoneOutput directory (uses app_path if None)
open_browserboolFalseOpen app in browser after export
track_evolutionboolTrueTrack changes between exports
update_appboolFalseRefresh bundled front-end app files when the app already exists
copy_assetsdict | list[dict] | NoneNoneCopy extra local files/directories into the exported app using the same rules as copy_assets()
base_dirstr | Path | NoneNoneBase directory for relative copy_assets.from paths (defaults to current working directory)
quietbool | NoneNoneOverride catalog quiet setting

Exports complete standalone datannur app with data. Uses app_path by default if set at init. Existing apps update data/db by default; pass update_app=True to refresh bundled front-end files.

Catalog.finalize()

python
catalog.finalize()

Advanced lifecycle method. Removes entities no longer seen during scan.

In normal usage, you usually do not need to call it directly: export_db() and export_app() call it automatically after scanning.

run_config()

python
from datannurpy import run_config

catalog = run_config(path)
ParameterTypeDefaultDescription
pathstr | PathrequiredYAML configuration file to load and execute

Runs a catalog.yml workflow and returns the resulting Catalog.

copy_assets()

python
from datannurpy import copy_assets

copy_assets(output_dir, rules, *, base_dir=None, quiet=False)
ParameterTypeDefaultDescription
output_dirstr | PathrequiredExport directory to populate
rulesdict | list[dict]requiredCopy rules using the same shape as YAML copy_assets
base_dirstr | Path | NoneNoneBase directory for relative from paths (defaults to current working directory)
quietboolFalseSuppress copy progress logging

Each rule accepts from, to, optional include, and optional clean.

Catalog.export_db() and Catalog.export_app() also accept copy_assets= and base_dir= as convenience wrappers around this helper.

EntityMetadata

python
EntityMetadata(
    id=None,
    parent_id=None,
    manager_id=None,
    owner_id=None,
    tag_ids=None,
    doc_ids=None,
    name=None,
    description=None,
    license=None,
    type=None,
    link=None,
    localisation=None,
    start_date=None,
    end_date=None,
    updating_each=None,
    no_more_update=None,
)
ParameterTypeDescription
idstr | NoneExplicit entity ID. If omitted, scan-derived defaults are used.
parent_idstr | NoneParent folder ID (Folder.parent_id for folders, Dataset.folder_id for datasets).
manager_idstr | NoneManaging organization ID.
owner_idstr | NoneOwning organization ID.
tag_idslist[str] | NoneRelated tag IDs.
doc_idslist[str] | NoneRelated document IDs.
namestr | NoneDisplay name.
descriptionstr | NoneDescription text.
licensestr | NoneLicense string.
typestr | NoneEntity type/category.
linkstr | NoneExternal reference URL.
localisationstr | NoneGeographic coverage.
start_datestr | NoneCovered period start.
end_datestr | NoneCovered period end.
updating_eachstr | NoneUpdate frequency.
no_more_updatestr | NoneMarker that no further updates are expected.

In YAML configs, the same metadata is usually written as top-level keys on an add entry:

yaml
add:
  - folder: ./data
    id: source
    name: Source data
    description: Curated files used by the analytics team.

  - dataset: ./data/sales.csv
    id: source---sales
    folder_id: source
    name: Sales
    description: Monthly sales by product and region.

EntityMetadata is the Python API equivalent of those YAML metadata keys.

ID helpers

python
from datannurpy import sanitize_id, build_dataset_id, build_variable_id
FunctionDescriptionExample
sanitize_id(s)Clean string for use as ID"My File (v2)" → "My File v2"
build_dataset_id(folder_id, dataset_name)Build dataset ID("src", "sales") → "src---sales"
build_variable_id(folder_id, dataset_name, var)Build variable ID("src", "sales", "amount") → "src---sales---amount"