Module `markov.api.utils.uploader`

this is python code

Functions

def manual_upload_dataframe(df: pandas.core.frame.DataFrame, segment_type: str, dataset_name: str)

Manually uploads a DataFrame to a specified location and returns the upload path and credential ID.

This function handles the process of uploading a pandas DataFrame to a storage location. It involves several steps: compressing the DataFrame, splitting it into manageable parts, initiating a multipart upload, uploading each part, and finally completing the upload process. If the DataFrame size exceeds the maximum allowed size, an IOError is raised. The function uses a spinner to indicate progress.

Args

df : pd.DataFrame: The DataFrame to be uploaded.
segment_type : str: A string representing the type of the data segment being uploaded.
dataset_name : str: The name of the dataset to which this DataFrame belongs.

Returns

tuple: A tuple containing two elements: - uploaded_path (str): The path where the DataFrame has been uploaded. - cred_id (str): The credential ID used for the upload.

Raises

IOError: If the DataFrame size exceeds the maximum allowed size.
UploadException: If there is a failure in uploading the DataFrame.

def manual_upload_dataframes(ds_paths: List[DataSegmentPath], dataset_name: str)

Uploads data segments to an S3 store and collects their new S3 URLs along with credential IDs.

This function iterates over a list of data segments, uploads each segment to a default S3 store, and then collects the new S3 URL and credential ID for each upload. It assumes that all uploads use the same credential ID, hence it extracts a unique credential ID from the collected IDs.

Args

ds_paths : list: A list of data segment objects, where each object contains information about the data source and segment type.
dataset_name : str: The name of the dataset to which these data segments belong.

Returns

tuple: A tuple containing two elements: - A list of DataSegmentPath objects, each containing the S3 URL and segment type of the uploaded data segments. - A unique credential ID (str) used for the S3 uploads.

def manual_upload_dataset_path(file_path: str, segment_type: str)

Manually uploads a dataset file to a server in parts after compressing and splitting it.

This function handles the process of uploading a large dataset file. It first compresses the file, validates its size, splits it into manageable parts, and then uploads these parts to a server. The process involves multipart upload with presigned URLs.

Args

file_path : str: The path to the dataset file to be uploaded.
segment_type : str: The type of the dataset segment (e.g., 'train', 'test', 'validate').

Returns

tuple: A tuple containing the path where the file was uploaded and the credential ID.

The format is (uploaded_path, cred_id).

Raises

IOError: If the compressed file size exceeds the maximum allowed size.
UploadException: If there is a failure during the file upload process.

Note

This function cleans up the compressed file created during the process by deleting it after the upload is completed or in case of an exception.

def manual_upload_dataset_paths(ds_paths: List[DataSegmentPath])

def upload_large_dataframe_as_parts(df_buffer: _io.BytesIO, upload_urls: List[str], part_size: int) ‑> List

Uploads a large dataframe in parts to multiple upload URLs concurrently.

Args

df_buffer : Bytes: Buffer containing dataframe data
upload_urls : List[str]: List of upload URLs for each part.

Returns

List: List of responses from the uploads.