Module markov.api.utils.common_utils
Functions
def create_df_config(location, name)
-
Create configuration file to create a new data_set family.
Args
location
- Valid location where this file is to be created on user system.
name
- User specified name of this configuration file.
Returns
Full valid name of the generated configuration file with config file generated at the specified location if created.
def create_ds_config(location, name)
-
Create the dataset configuration file at the user provided location on their system with given name.
Args
location
- Valid location on the user system.
name
- Name of the file.
Returns
Full valid name of the generated configuration file with config file generated at the specified location if possible.
def create_upload_config(location, name)
-
Create config file to upload dataset to specified location.
Args
location
- Valid location on the user system where file has to be generated.
name
- User specified custom name of the configuration file.
Returns
Valid full path name of the configuration file.
def gen_text_hash(series: pd.Series)
def gen_unique_hash(*args)
-
Generating a unique hash given the input
Args
*args (): Returns:
def generate_composite_key(*args)
def get_mkv_config() ‑> MKVClientCredentials
-
Get the user created mkv_config containing information about user's base url and access token.
Returns
MKVClientCredentials config containing relevant information.
def infer_categorical_cols(df: DataFrame, min_count: int = 50, high_threshold: int = 0.9, low_threshold: int = 0.7) ‑> List
-
Infer categorical cols in the dataset based on heuristics: heuristics = (num_count-num_unique_count)/num_count>=threshold
Args
df
- df from the data_set segment which has to be analyzed for categorical cols.
min_count
- Minimum number to apply higher threshold.
high_threshold
- Higher threshold applied when the num_rows of DataFrame is > min_count.
low_threshold
- If the number of samples is < min_count then lower threshold is applied.
Returns
List of categorical columns.
def infer_data_type(df: DataFrame) ‑> Dict
-
Given a DataFrame infer type of data_set.
Args
df
- Input dataframe of the data_set segment to be processed to identify right dtype of the column
Returns
Dictionary with column names as keys and their inferred type (VisionBaseType) as values.
def validate_upload_config(upload_config)
-
Validate the upload config against the server and avoid duplicates/overwrites.
Args
upload_config
- Input upload config.
Classes
class MarkovDatasetUtils
-
Static methods
def normalize(dataframe: DataFrame, col_names: List[str], filters: List[str], replace_symbol: str = ' ')
-
Normalization pipeline which applies the text filters one by one
>>> import markov >>> markov.dataset_utils.normalize(dataframe=dataframe, col_names=["x", "y"], filters=["url", "email"]) >>> <dataframe>
class PresignedUrlParser (url)
-
s = PresignedUrlParser("s3://bucket/hello/world.csv?qwe1=3#ddd") s.bucket
'bucket' s.filename 'world.csv' s.key 'hello/world.csv?qwe1=3#ddd' s.url 's3://bucket/hello/world.csv?qwe1=3#ddd'
Instance variables
prop bucket
prop filename
prop key
prop url
class TextFilter (value, names=None, *, module=None, qualname=None, type=None, start=1)
-
An enumeration.
Ancestors
- builtins.str
- enum.Enum
Class variables
var BRACKET : Final
var CURRENCY : Final
var EMAIL : Final
var EMOJI : Final
var HTML : Final
var MENTION_AT : Final
var MENTION_HASH : Final
var MULTI_WHITESPACE : Final
var NEW_LINE : Final
var NUMBERS : Final
var PHONE : Final
var SPECIAL_CHARS : Final
var URL : Final
Static methods
def has_member_key(key)