Module `markov.api.utils.common_utils`

Functions

def create_df_config(location, name)

Create configuration file to create a new data_set family.

Args

location: Valid location where this file is to be created on user system.
name: User specified name of this configuration file.

Returns

Full valid name of the generated configuration file with config file generated at the specified location if created.

def create_ds_config(location, name)

Create the dataset configuration file at the user provided location on their system with given name.

Args

location: Valid location on the user system.
name: Name of the file.

Returns

Full valid name of the generated configuration file with config file generated at the specified location if possible.

def create_upload_config(location, name)

Create config file to upload dataset to specified location.

Args

location: Valid location on the user system where file has to be generated.
name: User specified custom name of the configuration file.

Returns

Valid full path name of the configuration file.

def gen_text_hash(series: pd.Series)

def gen_unique_hash(*args)

Generating a unique hash given the input

Args

*args (): Returns:

def generate_composite_key(*args)

def get_mkv_config() ‑> MKVClientCredentials

Get the user created mkv_config containing information about user's base url and access token.

Returns

MKVClientCredentials config containing relevant information.

def infer_categorical_cols(df: DataFrame, min_count: int = 50, high_threshold: int = 0.9, low_threshold: int = 0.7) ‑> List

Infer categorical cols in the dataset based on heuristics: heuristics = (num_count-num_unique_count)/num_count>=threshold

Args

df: df from the data_set segment which has to be analyzed for categorical cols.
min_count: Minimum number to apply higher threshold.
high_threshold: Higher threshold applied when the num_rows of DataFrame is > min_count.
low_threshold: If the number of samples is < min_count then lower threshold is applied.

Returns

List of categorical columns.

def infer_data_type(df: DataFrame) ‑> Dict

Given a DataFrame infer type of data_set.

Args

df: Input dataframe of the data_set segment to be processed to identify right dtype of the column

Returns

Dictionary with column names as keys and their inferred type (VisionBaseType) as values.

def validate_upload_config(upload_config)

Validate the upload config against the server and avoid duplicates/overwrites.

Args

upload_config: Input upload config.

Classes

class MarkovDatasetUtils

Static methods

def normalize(dataframe: DataFrame, col_names: List[str], filters: List[str], replace_symbol: str = ' ')

Normalization pipeline which applies the text filters one by one

>>> import markov
>>> markov.dataset_utils.normalize(dataframe=dataframe, col_names=["x", "y"], filters=["url", "email"])
>>> <dataframe>

class PresignedUrlParser (url)

s = PresignedUrlParser("s3://bucket/hello/world.csv?qwe1=3#ddd") s.bucket

'bucket' s.filename 'world.csv' s.key 'hello/world.csv?qwe1=3#ddd' s.url 's3://bucket/hello/world.csv?qwe1=3#ddd'

Instance variables

prop bucket
prop filename
prop key
prop url

class TextFilter (value, names=None, *, module=None, qualname=None, type=None, start=1)

An enumeration.

Ancestors

builtins.str
enum.Enum

Class variables

var BRACKET : Final
var CURRENCY : Final
var EMAIL : Final
var EMOJI : Final
var HTML : Final
var MENTION_AT : Final
var MENTION_HASH : Final
var MULTI_WHITESPACE : Final
var NEW_LINE : Final
var NUMBERS : Final
var PHONE : Final
var SPECIAL_CHARS : Final
var URL : Final

Static methods

def has_member_key(key)