smart_reader

Classes

SmartDataReader

Adaptive data reader that chooses optimal technology based on file size.

Module Contents

class smart_reader.SmartDataReader(file_path: str | pathlib.Path)[source]

Adaptive data reader that chooses optimal technology based on file size. Strategy: - <10MB: pandas (simplicity, compatibility) - 10-500MB: polars (speed, memory efficiency) - >500MB: polars lazy (streaming, minimal memory) - CSV always: pyarrow or polars (much faster than pandas)

_choose_engine() str[source]

Choose optimal engine based on file size.

_read_pandas(sheet_name: str | None = None) pandas.DataFrame[source]

Small files: Use pandas.

_read_pandas_chunked(sheet_name: str | None = None, chunk_size: int = 10000) pandas.DataFrame[source]

Read large Excel files in chunks, return first chunk for preview.

_read_polars(sheet_name: str | None = None) pandas.DataFrame[source]

Medium files: Use polars, convert to pandas.

_read_polars_lazy(sheet_name: str | None = None) pandas.DataFrame[source]

Large files: Use lazy evaluation, process in chunks.

_read_pyarrow() pandas.DataFrame[source]

CSV with PyArrow (fastest CSV reader).

estimate_memory() str[source]

Estimate memory usage.

Returns:

Human-readable memory estimate string

read(sheet_name: str | None = None) pandas.DataFrame[source]

Read file using optimal engine, always return pandas DataFrame.

Parameters:

sheet_name – Sheet name for Excel files (optional)

Returns:

pandas DataFrame with file contents

Why pandas output? - Rest of codebase expects pandas - Can convert polars → pandas at end - Only final result in memory

LARGE_FILE = 524288000[source]
SMALL_FILE = 10485760[source]
engine = 'polars'[source]
file_path[source]
file_size[source]