dicee.read_preprocess_save_load_kg.util

Functions

polars_dataframe_indexer(→ polars.DataFrame)

Replaces 'subject', 'relation', and 'object' columns in the input Polars DataFrame with their corresponding index values

pandas_dataframe_indexer(→ pandas.DataFrame)

Replaces 'subject', 'relation', and 'object' columns in the input Pandas DataFrame with their corresponding index values

apply_reciprocal_or_noise(add_reciprocal, eval_model)

Add reciprocal triples if conditions are met

timeit(func)

read_with_polars(→ polars.DataFrame)

Load and Preprocess via Polars

read_with_pandas(data_path[, read_only_few, ...])

Load and Preprocess via Pandas

read_from_disk(→ Tuple[polars.DataFrame, pandas.DataFrame])

count_triples(→ int)

Returns the total number of triples in the triple store.

fetch_worker(endpoint, offsets, chunk_size, ...)

Worker process: fetch assigned chunks and save to disk with per-worker tqdm.

read_from_triple_store_with_polars(endpoint[, ...])

Main function to read all triples in parallel, save as Parquet, and load into Polars dataframe.

read_from_triple_store_with_pandas([endpoint])

Read triples from triple store into pandas dataframe

get_er_vocab(data[, file_path])

get_re_vocab(data[, file_path])

get_ee_vocab(data[, file_path])

create_constraints(triples[, file_path])

load_with_pandas(→ None)

Deserialize data

save_numpy_ndarray(*, data, file_path)

load_numpy_ndarray(*, file_path)

save_pickle(*, data[, file_path])

load_pickle(*[, file_path])

create_recipriocal_triples(x)

Add inverse triples into dask dataframe

dataset_sanity_checking(→ None)

Module Contents

dicee.read_preprocess_save_load_kg.util.polars_dataframe_indexer(df_polars: polars.DataFrame, idx_entity: polars.DataFrame, idx_relation: polars.DataFrame) polars.DataFrame

Replaces ‘subject’, ‘relation’, and ‘object’ columns in the input Polars DataFrame with their corresponding index values from the entity and relation index DataFrames.

This function processes the DataFrame in three main steps: 1. Replace the ‘relation’ values with the corresponding index from idx_relation. 2. Replace the ‘subject’ values with the corresponding index from idx_entity. 3. Replace the ‘object’ values with the corresponding index from idx_entity.

Parameters:

df_polarspolars.DataFrame

The input Polars DataFrame containing columns: ‘subject’, ‘relation’, and ‘object’.

idx_entitypolars.DataFrame

A Polars DataFrame that contains the mapping between entity names and their corresponding indices. Must have columns: ‘entity’ and ‘index’.

idx_relationpolars.DataFrame

A Polars DataFrame that contains the mapping between relation names and their corresponding indices. Must have columns: ‘relation’ and ‘index’.

Returns:

polars.DataFrame

A DataFrame with the ‘subject’, ‘relation’, and ‘object’ columns replaced by their corresponding indices.

Example Usage:

>>> df_polars = pl.DataFrame({
        "subject": ["Alice", "Bob", "Charlie"],
        "relation": ["knows", "works_with", "lives_in"],
        "object": ["Dave", "Eve", "Frank"]
    })
>>> idx_entity = pl.DataFrame({
        "entity": ["Alice", "Bob", "Charlie", "Dave", "Eve", "Frank"],
        "index": [0, 1, 2, 3, 4, 5]
    })
>>> idx_relation = pl.DataFrame({
        "relation": ["knows", "works_with", "lives_in"],
        "index": [0, 1, 2]
    })
>>> polars_dataframe_indexer(df_polars, idx_entity, idx_relation)

Steps:

  1. Join the input DataFrame df_polars on the ‘relation’ column with idx_relation to replace the relations with their indices.

  2. Join on ‘subject’ to replace it with the corresponding entity index using a left join on idx_entity.

  3. Join on ‘object’ to replace it with the corresponding entity index using a left join on idx_entity.

  4. Select only the ‘subject’, ‘relation’, and ‘object’ columns to return the final result.

dicee.read_preprocess_save_load_kg.util.pandas_dataframe_indexer(df_pandas: pandas.DataFrame, idx_entity: pandas.DataFrame, idx_relation: pandas.DataFrame) pandas.DataFrame

Replaces ‘subject’, ‘relation’, and ‘object’ columns in the input Pandas DataFrame with their corresponding index values from the entity and relation index DataFrames.

Parameters:

df_pandaspd.DataFrame

The input Pandas DataFrame containing columns: ‘subject’, ‘relation’, and ‘object’.

idx_entitypd.DataFrame

A Pandas DataFrame that contains the mapping between entity names and their corresponding indices. Must have columns: ‘entity’ and ‘index’.

idx_relationpd.DataFrame

A Pandas DataFrame that contains the mapping between relation names and their corresponding indices. Must have columns: ‘relation’ and ‘index’.

Returns:

pd.DataFrame

A DataFrame with the ‘subject’, ‘relation’, and ‘object’ columns replaced by their corresponding indices.

dicee.read_preprocess_save_load_kg.util.apply_reciprocal_or_noise(add_reciprocal: bool, eval_model: str, df: object = None, info: str = None)

Add reciprocal triples if conditions are met

dicee.read_preprocess_save_load_kg.util.timeit(func)
dicee.read_preprocess_save_load_kg.util.read_with_polars(data_path, read_only_few: int = None, sample_triples_ratio: float = None, separator: str = None) polars.DataFrame

Load and Preprocess via Polars

dicee.read_preprocess_save_load_kg.util.read_with_pandas(data_path, read_only_few: int = None, sample_triples_ratio: float = None, separator: str = None)

Load and Preprocess via Pandas

dicee.read_preprocess_save_load_kg.util.read_from_disk(data_path: str, read_only_few: int = None, sample_triples_ratio: float = None, backend: str = None, separator: str = None) Tuple[polars.DataFrame, pandas.DataFrame]
dicee.read_preprocess_save_load_kg.util.count_triples(endpoint: str) int

Returns the total number of triples in the triple store.

dicee.read_preprocess_save_load_kg.util.fetch_worker(endpoint: str, offsets: list[int], chunk_size: int, output_dir: str, worker_id: int)

Worker process: fetch assigned chunks and save to disk with per-worker tqdm.

dicee.read_preprocess_save_load_kg.util.read_from_triple_store_with_polars(endpoint: str, chunk_size: int = 500000, output_dir: str = 'triples_parquet')

Main function to read all triples in parallel, save as Parquet, and load into Polars dataframe.

dicee.read_preprocess_save_load_kg.util.read_from_triple_store_with_pandas(endpoint: str = None)

Read triples from triple store into pandas dataframe

dicee.read_preprocess_save_load_kg.util.get_er_vocab(data, file_path: str = None)
dicee.read_preprocess_save_load_kg.util.get_re_vocab(data, file_path: str = None)
dicee.read_preprocess_save_load_kg.util.get_ee_vocab(data, file_path: str = None)
dicee.read_preprocess_save_load_kg.util.create_constraints(triples, file_path: str = None)
  1. Extract domains and ranges of relations

(2) Store a mapping from relations to entities that are outside of the domain and range. Crete constrainted entities based on the range of relations :param triples: :return: Tuple[dict, dict]

dicee.read_preprocess_save_load_kg.util.load_with_pandas(self) None

Deserialize data

dicee.read_preprocess_save_load_kg.util.save_numpy_ndarray(*, data: numpy.ndarray, file_path: str)
dicee.read_preprocess_save_load_kg.util.load_numpy_ndarray(*, file_path: str)
dicee.read_preprocess_save_load_kg.util.save_pickle(*, data: object, file_path=str)
dicee.read_preprocess_save_load_kg.util.load_pickle(*, file_path=str)
dicee.read_preprocess_save_load_kg.util.create_recipriocal_triples(x)

Add inverse triples into dask dataframe :param x: :return:

dicee.read_preprocess_save_load_kg.util.dataset_sanity_checking(train_set: numpy.ndarray, num_entities: int, num_relations: int) None
Parameters:
  • train_set

  • num_entities

  • num_relations

Returns: