Skip to content

parquet_loader

parquet_loader ¤

ParquetLoader ¤

ParquetLoader(base_path: str)

This class provides class methods to load parquet files from a specified directory structure.

Initialize the ParquetLoader with the base directory path.

Parameters:

Name Type Description Default
base_path str

The base directory where parquet files are stored.

required

load_all_files classmethod ¤

load_all_files(base_path: str) -> pd.DataFrame

Loads all parquet files in the specified base directory into a single pandas DataFrame.

Parameters:

Name Type Description Default
base_path str

The base directory where parquet files are stored.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing all the data from the parquet files.

load_by_time_range classmethod ¤

load_by_time_range(
    base_path: str,
    start_time: Timestamp,
    end_time: Timestamp,
) -> pd.DataFrame

Loads parquet files that fall within a specified time range based on the directory structure.

The directory structure is expected to be in the format YYYY/MM/DD/HH.

Parameters:

Name Type Description Default
base_path str

The base directory where parquet files are stored.

required
start_time Timestamp

The start timestamp.

required
end_time Timestamp

The end timestamp.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing the data from the parquet files within the time range.

load_by_uuid_list classmethod ¤

load_by_uuid_list(
    base_path: str, uuid_list: list
) -> pd.DataFrame

Loads parquet files that match any UUID in the specified list.

The UUIDs are expected to be part of the file names.

Parameters:

Name Type Description Default
base_path str

The base directory where parquet files are stored.

required
uuid_list list

A list of UUIDs to filter the files.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing the data from the parquet files with matching UUIDs.

load_files_by_time_range_and_uuids classmethod ¤

load_files_by_time_range_and_uuids(
    base_path: str,
    start_time: Timestamp,
    end_time: Timestamp,
    uuid_list: list,
) -> pd.DataFrame

Loads parquet files that fall within a specified time range and match any UUID in the list.

The directory structure is expected to be in the format YYYY/MM/DD/HH, and UUIDs are part of the file names.

Parameters:

Name Type Description Default
base_path str

The base directory where parquet files are stored.

required
start_time Timestamp

The start timestamp.

required
end_time Timestamp

The end timestamp.

required
uuid_list list

A list of UUIDs to filter the files.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing the data from the parquet files that meet both criteria.