Skip to content

Scanning files

Use Scan depth to choose how much metadata datannurpy extracts. The same depth setting applies to add_folder, add_dataset, and add_database, either globally or per add entry.

Scanning files

yaml
add:
  # Scan a folder (CSV, Excel, SAS)
  - folder: ./data

  # With custom folder metadata
  - folder: ./data
    id: prod
    name: Production

  # With filtering options
  - folder: ./data
    include: ["*.csv", "*.xlsx"]
    exclude: ["**/tmp/**"]
    recursive: true
    csv_encoding: utf-8        # or cp1252, iso-8859-1 (auto-detected by default)

  # Multiple folders with shared options
  - folder: [./data/sales, ./data/hr]
    include: ["*.csv"]

  # A single file
  - dataset: ./data/sales.csv

  # Multiple files
  - dataset:
      - ./data/sales.csv
      - ./data/products.csv

Filtering patterns

include and exclude patterns are matched against normalized relative paths from the scanned folder. Paths use / separators on every platform, and leading / or ./ in patterns is ignored. Filtering first keeps files that match at least one include pattern when include is set, then removes files that match any exclude pattern.

Examples:

PatternMeaning
name.csvExact file at the scanned folder root
subdir/name.csvExact relative file path
*.csvAny CSV file at any depth
subdir/*.csvCSV files directly inside subdir
**/tmp/**Files under any tmp directory
tmp/Everything under the root tmp directory
**/tmp/Everything under any directory named tmp

Time series detection

When time_series: true (default), files with temporal patterns in their names or parent folders are automatically grouped into a single dataset:

data/
├── enquete_2020.csv    ─┐
├── enquete_2021.csv     ├─→ Single dataset "enquete" with nb_resources=3
├── enquete_2022.csv    ─┘
└── reference.csv       ─→ Separate dataset "reference"

The resulting dataset includes nb_resources, start_date, and end_date. Variables track their own start_date and end_date when their presence changes across periods.

Set time_series: false to treat each file as a separate dataset.

See Time series grouping for supported patterns, database table grouping, schema evolution, and false-positive rules.

Parquet formats

Supports simple Parquet files and partitioned datasets (Delta, Hive, Iceberg):

yaml
add:
  # add_folder auto-detects all formats
  - folder: ./data             # scans *.parquet + Delta/Hive/Iceberg directories

  # Single partitioned directory with metadata override
  - dataset: ./data/sales_delta
    name: Sales Data
    description: Monthly sales
    folder:
      id: sales
      name: Sales

With extras [delta] and [iceberg], metadata (name, description, column docs) is extracted when available.

CSV options

Avoid the UTF-8 temp copy when files are already local and UTF-8 (auto-fallback if encoding detection fails):

yaml
csv_skip_copy: true

Remote storage

Scan files on SFTP servers or cloud storage (S3, Azure, GCS). The storage_options dict is passed directly to fsspec — see provider docs for available options:

yaml
env_file: .env               # SFTP_PASSWORD, AWS_KEY, AWS_SECRET, etc.

SFTP

Requires pip install datannurpy[ssh].

yaml
add:
  - folder: sftp://user@host/path/to/data
    storage_options:
      password: ${SFTP_PASSWORD}   # or key_filename: /path/to/key

Amazon S3

Requires pip install datannurpy[s3].

yaml
add:
  - folder: s3://my-bucket/data
    storage_options:
      key: ${AWS_KEY}
      secret: ${AWS_SECRET}

Azure Blob

Requires pip install datannurpy[azure].

yaml
add:
  - folder: az://container/data
    storage_options:
      account_name: ${AZURE_ACCOUNT}
      account_key: ${AZURE_KEY}

Google Cloud Storage

Requires pip install datannurpy[gcs].

yaml
add:
  - folder: gs://my-bucket/data
    storage_options:
      token: /path/to/credentials.json

Single remote file

yaml
add:
  - dataset: s3://my-bucket/data/sales.parquet
    storage_options:
      key: ${AWS_KEY}
      secret: ${AWS_SECRET}

Sampling

By default, sample_size is 100000. All entries inherit this value. Override per entry, or set null to disable:

yaml
sample_size: 100000               # default

add:
  - folder: ./data                # inherits 100000

  - folder: ./small
    sample_size: null             # no sampling

  - database: postgresql://localhost/mydb
    sample_size: 50000            # override

To disable sampling globally:

yaml
sample_size: null

When a dataset has more rows than sample_size, a uniform random sample is used for frequency counts and enumeration detection. All other statistics (nb_row, nb_missing, nb_distinct, min, max, mean, std) are computed on the full dataset.

The actual number of sampled rows is recorded in Dataset.sample_size (null when no sampling was applied).

Dataset previews

By default, preview_rows is 100. At stat and value depth, each scanned dataset exports up to that many rows in preview/<dataset_id>.json and preview/<dataset_id>.json.js. These rows come from data already read during scanning when possible, including reservoir samples used for frequency detection.

Override the limit per file source, or set false to disable previews for one source while keeping the global default:

yaml
preview_rows: 100

add:
  - folder: ./data/public
    preview_rows: 50

  - dataset: ./data/private.csv
    preview_rows: false

Previews are scan-time data. They are not generated at dataset or variable depth, and export commands do not have a separate preview_rows override.