Module markov.api.utils.common_utils

Functions

def create_df_config(location, name)

Create configuration file to create a new data_set family.

Args

location
Valid location where this file is to be created on user system.
name
User specified name of this configuration file.

Returns

Full valid name of the generated configuration file with config file generated at the specified location if created.

def create_ds_config(location, name)

Create the dataset configuration file at the user provided location on their system with given name.

Args

location
Valid location on the user system.
name
Name of the file.

Returns

Full valid name of the generated configuration file with config file generated at the specified location if possible.

def create_upload_config(location, name)

Create config file to upload dataset to specified location.

Args

location
Valid location on the user system where file has to be generated.
name
User specified custom name of the configuration file.

Returns

Valid full path name of the configuration file.

def gen_text_hash(series: pd.Series)
def gen_unique_hash(*args)

Generating a unique hash given the input

Args

*args (): Returns:

def generate_composite_key(*args)
def get_mkv_config() ‑> MKVClientCredentials

Get the user created mkv_config containing information about user's base url and access token.

Returns

MKVClientCredentials config containing relevant information.

def infer_categorical_cols(df: DataFrame, min_count: int = 50, high_threshold: int = 0.9, low_threshold: int = 0.7) ‑> List

Infer categorical cols in the dataset based on heuristics: heuristics = (num_count-num_unique_count)/num_count>=threshold

Args

df
df from the data_set segment which has to be analyzed for categorical cols.
min_count
Minimum number to apply higher threshold.
high_threshold
Higher threshold applied when the num_rows of DataFrame is > min_count.
low_threshold
If the number of samples is < min_count then lower threshold is applied.

Returns

List of categorical columns.

def infer_data_type(df: DataFrame) ‑> Dict

Given a DataFrame infer type of data_set.

Args

df
Input dataframe of the data_set segment to be processed to identify right dtype of the column

Returns

Dictionary with column names as keys and their inferred type (VisionBaseType) as values.

def validate_upload_config(upload_config)

Validate the upload config against the server and avoid duplicates/overwrites.

Args

upload_config
Input upload config.

Classes

class MarkovDatasetUtils

Static methods

def normalize(dataframe: DataFrame, col_names: List[str], filters: List[str], replace_symbol: str = ' ')

Normalization pipeline which applies the text filters one by one

>>> import markov
>>> markov.dataset_utils.normalize(dataframe=dataframe, col_names=["x", "y"], filters=["url", "email"])
>>> <dataframe>
class PresignedUrlParser (url)

s = PresignedUrlParser("s3://bucket/hello/world.csv?qwe1=3#ddd") s.bucket

'bucket' s.filename 'world.csv' s.key 'hello/world.csv?qwe1=3#ddd' s.url 's3://bucket/hello/world.csv?qwe1=3#ddd'

Instance variables

prop bucket
prop filename
prop key
prop url
class TextFilter (value, names=None, *, module=None, qualname=None, type=None, start=1)

An enumeration.

Ancestors

  • builtins.str
  • enum.Enum

Class variables

var BRACKET : Final
var CURRENCY : Final
var EMAIL : Final
var EMOJI : Final
var HTML : Final
var MENTION_AT : Final
var MENTION_HASH : Final
var MULTI_WHITESPACE : Final
var NEW_LINE : Final
var NUMBERS : Final
var PHONE : Final
var SPECIAL_CHARS : Final
var URL : Final

Static methods

def has_member_key(key)