trailpack.validation.standard_validator

Standard validator for Trailpack data packages.

Validates metadata, resources, fields, and data quality against the Trailpack standard specification.

Classes

StandardValidator

Validate data packages against Trailpack standards.

ValidationResult

Result of a validation check.

Module Contents

class trailpack.validation.standard_validator.StandardValidator(version: str = '1.0.0')[source]

Validate data packages against Trailpack standards.

The StandardValidator checks data packages for: - Metadata completeness (required and recommended fields) - Resource definitions (proper schema, formats, name sanitization) - Field definitions (types, units, constraints) - Data quality (missing values, duplicates, type consistency) - Schema matching (column types match field definitions)

All numeric fields must have units specified, even for dimensionless quantities: - Measurements: Use appropriate SI or domain units (kg, m, °C, etc.) - Counts/IDs: Use dimensionless unit (http://qudt.org/vocab/unit/NUM) - Percentages: Use percent or dimensionless unit

Resource Name Sanitization: Resource names must match ^[a-z0-9-_.]+$. The validator automatically: - Detects invalid resource names - Suggests sanitized alternatives - Can auto-sanitize names with sanitize_resource_name()

Automatic Inconsistency Export: When type inconsistencies are detected during validation (e.g., mixed types in a column), each inconsistent value is tracked and automatically exported to ‘data_inconsistencies.csv’ when the ValidationResult is printed. This provides a detailed breakdown for data cleaning workflows.

Example

>>> validator = StandardValidator("1.0.0")
>>> result = validator.validate_metadata(metadata)
>>> if result.is_valid:
...     print("✅ Valid!")
... else:
...     print(result)
>>> # Validate with schema (auto-exports inconsistencies.csv if errors found)
>>> result = validator.validate_data_quality(df, schema=schema)
>>> print(result)  # Shows errors and exports CSV automatically
>>> # Sanitize resource names
>>> clean_name = validator.sanitize_resource_name("My File!")
>>> print(clean_name)  # "my_file"

Initialize validator with a specific standard version.

Parameters:

version – Standard version to validate against (default: “1.0.0”)

_determine_level(result: ValidationResult) str[source]

Determine validation level based on errors and warnings.

Parameters:

result – ValidationResult to evaluate

Returns:

Validation level badge string

_load_standard(version: str) Dict[str, Any][source]

Load the standard specification from YAML.

_validate_data_against_schema(df: pandas.DataFrame, schema: Dict[str, Any], quality_spec: Dict[str, Any]) ValidationResult[source]

Validate DataFrame against field schema definitions.

Checks that actual column types match declared field types, and that numeric fields have proper units defined.

Parameters:
  • df – DataFrame to validate

  • schema – Schema dictionary with field definitions

  • quality_spec – Quality specification from standard

Returns:

ValidationResult with schema validation errors

_validate_field_value(field_name: str, value: Any, field_def: Dict[str, Any]) ValidationResult[source]

Validate a specific field value against its definition.

Parameters:
  • field_name – Name of the field

  • value – Value to validate

  • field_def – Field definition from standard

Returns:

ValidationResult with validation errors

get_help_url(topic: str) str | None[source]

Get help URL for a specific topic.

Parameters:

topic – Topic name (e.g., ‘frictionless_spec’, ‘qudt_units’)

Returns:

URL string or None if not found

sanitize_resource_name(name: str) str[source]

Sanitize resource name to match the required pattern ^[a-z0-9-_.]+$.

The resource name must only contain: - Lowercase letters (a-z) - Numbers (0-9) - Hyphens (-) - Underscores (_) - Dots (.)

Parameters:

name – Raw name string to sanitize

Returns:

Sanitized name matching the required pattern

Example

>>> validator = StandardValidator()
>>> validator.sanitize_resource_name("My Resource Name!")
'my_resource_name'
>>> validator.sanitize_resource_name("Test@123")
'test123'
validate_all(metadata: Dict[str, Any], df: pandas.DataFrame | None = None, mappings: Dict[str, Any] | None = None) ValidationResult[source]

Validate everything: metadata, data quality, and mappings.

Parameters:
  • metadata – Data package metadata dictionary

  • df – Optional DataFrame to validate data quality

  • mappings – Optional field mappings to validate

Returns:

ValidationResult with all validation results

validate_and_sanitize_resource_name(name: str, auto_fix: bool = False) Tuple[bool, str, str | None][source]

Validate a resource name and optionally sanitize it.

Parameters:
  • name – Resource name to validate

  • auto_fix – If True, return sanitized name; if False, just validate

Returns:

Tuple of (is_valid, original_or_sanitized_name, suggestion) - is_valid: Whether the original name is valid - original_or_sanitized_name: Original name if valid/not auto_fix, sanitized if auto_fix - suggestion: Sanitized name suggestion if original is invalid, None otherwise

Example

>>> validator = StandardValidator()
>>> is_valid, name, suggestion = validator.validate_and_sanitize_resource_name("Invalid Name!")
>>> print(f"Valid: {is_valid}, Suggestion: {suggestion}")
Valid: False, Suggestion: invalid_name
>>> is_valid, name, _ = validator.validate_and_sanitize_resource_name("valid-name")
>>> print(f"Valid: {is_valid}, Name: {name}")
Valid: True, Name: valid-name
validate_data_quality(df: pandas.DataFrame, schema: Dict[str, Any] | None = None) ValidationResult[source]

Validate data quality of a DataFrame.

Data quality checks are logged as informational messages, not errors: - Missing data: Percentage of nulls per column - Duplicates: Percentage of duplicate rows

Type consistency checks RAISE ERRORS (not just logged): - Mixed types: Columns with multiple Python types (e.g., strings and integers mixed) - Schema matching: Column types must match field definitions - Unit requirements: Numeric fields must have units (including dimensionless for IDs/counts)

Automatic Inconsistency Export: When type inconsistencies are detected (mixed types in columns), each inconsistent value is tracked with its row number, column, actual type, and expected type. These inconsistencies are automatically exported to ‘data_inconsistencies.csv’ when the ValidationResult is printed. You can also manually export to a custom location using result.export_inconsistencies_to_csv(“custom_path.csv”).

Parameters:
  • df – DataFrame to validate

  • schema – Optional schema with field definitions to validate against. Should contain ‘fields’ list with field definitions including: - name: Field name matching column name - type: Field type (string, integer, number, boolean, etc.) - unit: Unit definition (required for numeric fields) - description: Field description

Returns:

  • errors: Type consistency violations (mixed types, schema mismatches)

  • info: Data quality metrics (nulls, duplicates)

  • inconsistencies: List of dicts with details about each inconsistent value (automatically exported to CSV when result is printed)

Return type:

ValidationResult with

Example

>>> schema = {
...     "fields": [
...         {
...             "name": "id",
...             "type": "integer",
...             "description": "Unique identifier",
...             "unit": {"name": "dimensionless", "path": "http://qudt.org/vocab/unit/NUM"}
...         },
...         {
...             "name": "mass",
...             "type": "number",
...             "description": "Mass measurement",
...             "unit": {"name": "kg", "path": "http://qudt.org/vocab/unit/KiloGM"}
...         }
...     ]
... }
>>> result = validator.validate_data_quality(df, schema=schema)
>>> # result.errors will contain type/schema mismatches and mixed type violations
>>> # result.info will contain data quality observations (nulls, duplicates)

Note

Identifier fields (with “id”, “index”, “identifier” in name or description) are automatically recognized and should use dimensionless units.

Returns:

ValidationResult with quality issues

validate_field_definition(field: Dict[str, Any]) ValidationResult[source]

Validate a field (column) definition.

Parameters:

field – Field dictionary from schema

Returns:

ValidationResult with validation errors and warnings

validate_metadata(metadata: Dict[str, Any]) ValidationResult[source]

Validate metadata against required and recommended fields.

Parameters:

metadata – Data package metadata dictionary

Returns:

ValidationResult with validation errors and warnings

validate_resource(resource: Dict[str, Any]) ValidationResult[source]

Validate a resource (data file) definition.

Automatically checks and suggests sanitized names for invalid resource names.

Parameters:

resource – Resource dictionary from metadata

Returns:

ValidationResult with validation errors and warnings

standard[source]
version = '1.0.0'[source]
class trailpack.validation.standard_validator.ValidationResult[source]

Result of a validation check.

Contains three types of messages: - errors: Type consistency violations that fail validation - warnings: Recommended fields or practices that should be addressed - info: Data quality metrics and informational messages

A validation is considered valid (passed) if there are no errors, regardless of warnings or info messages.

Data Inconsistency Tracking: When type inconsistencies are detected (e.g., mixed types in a column), each inconsistent value is tracked in the inconsistencies list. This list is automatically exported to ‘data_inconsistencies.csv’ when the result is printed or converted to string. The CSV file contains the row, column, value, actual type, and expected type for each inconsistency.

errors[source]

List of error messages (type consistency violations)

warnings[source]

List of warning messages (recommended practices)

info[source]

List of informational messages (data quality metrics)

level[source]

Validation compliance level (if assigned)

inconsistencies[source]

List of dicts with inconsistent value details

add_error(message: str, field: str | None = None)[source]

Add an error message.

add_inconsistency(row: int, column: str, value: Any, actual_type: str, expected_type: str)[source]

Track a data type inconsistency for later export.

This method is called automatically during validation when mixed types are detected in a column. Each inconsistent value (one that doesn’t match the most common type in the column) is recorded with its location and type information.

The inconsistencies are automatically exported to CSV when the ValidationResult is printed or can be manually exported using export_inconsistencies_to_csv().

Parameters:
  • row – Row index of the inconsistent value

  • column – Column name where the inconsistency was found

  • value – The actual inconsistent value

  • actual_type – Python type name of the value (e.g., ‘int’, ‘str’)

  • expected_type – Expected type based on most common type in column

add_info(message: str, field: str | None = None)[source]

Add an info message.

add_warning(message: str, field: str | None = None)[source]

Add a warning message.

export_inconsistencies_to_csv(output_path: str = 'data_inconsistencies.csv')[source]

Export data type inconsistencies to a CSV file for analysis.

Creates a CSV file with details about each value that has an inconsistent type compared to the expected type in its column. This is useful for data cleaning workflows where you need to identify and fix specific problematic values.

The CSV includes columns: row, column, value, actual_type, expected_type

This method is called automatically when printing the ValidationResult if inconsistencies exist, but can also be called manually to export to a custom location.

Parameters:

output_path – Path to the output CSV file. Defaults to “data_inconsistencies.csv” in the current working directory.

Returns:

Path to the created CSV file (str), or None if no inconsistencies to export.

Example

>>> result = validator.validate_data_quality(df, schema)
>>> if result.inconsistencies:
...     csv_path = result.export_inconsistencies_to_csv("issues.csv")
...     print(f"Found {len(result.inconsistencies)} issues in {csv_path}")
get_summary() str[source]

Get a summary of the validation result.

errors: List[str] = [][source]
property has_warnings: bool[source]

Check if there are any warnings.

inconsistencies: List[Dict[str, Any]] = [][source]
info: List[str] = [][source]
property is_valid: bool[source]

Check if validation passed (no errors).

level: str | None = None[source]
warnings: List[str] = [][source]