dicee

DICE Embeddings - Knowledge Graph Embedding Library.

A library for training and using knowledge graph embedding models with support for various scoring techniques and training strategies.

Submodules:

evaluation: Model evaluation functions and Evaluator class models: KGE model implementations trainer: Training orchestration scripts: Utility scripts

Submodules

Attributes

__version__

Classes

Execute

Executor class for training, retraining and evaluating KGE models.

KGE

Knowledge Graph Embedding Class for interactive usage of pre-trained models

QueryGenerator

DICE_Trainer

DICE_Trainer implement

Evaluator

Evaluator class for KGE models in various downstream tasks.

Package Contents

class dicee.Execute(args, continuous_training: bool = False)

Executor class for training, retraining and evaluating KGE models.

Handles the complete workflow: 1. Loading & Preprocessing & Serializing input data 2. Training & Validation & Testing 3. Storing all necessary information

args

Processed input arguments.

distributed

Whether distributed training is enabled.

rank

Process rank in distributed training.

world_size

Total number of processes.

local_rank

Local GPU rank.

trainer

Training handler instance.

trained_model

The trained model after training completes.

knowledge_graph

The loaded knowledge graph.

report

Dictionary storing training metrics and results.

evaluator

Model evaluation handler.

distributed
args
is_continual_training = False
trainer: dicee.trainer.DICE_Trainer | None = None
trained_model = None
knowledge_graph: dicee.knowledge_graph.KG | None = None
report: Dict
evaluator: dicee.evaluator.Evaluator | None = None
start_time: float | None = None
is_rank_zero() bool
cleanup()
setup_executor() None

Set up storage directories for the experiment.

Creates or reuses experiment directories based on configuration. Saves the configuration to a JSON file.

create_and_store_kg() None

Create knowledge graph and store as memory-mapped file.

Only executed on rank 0 in distributed training. Skips if memmap already exists.

load_from_memmap() None

Load knowledge graph from memory-mapped file.

save_trained_model() None

Save a knowledge graph embedding model

  1. Send model to eval mode and cpu.

  2. Store the memory footprint of the model.

  3. Save the model into disk.

  4. Update the stats of KG again ?

Parameter

rtype:

None

end(form_of_labelling: str) dict

End training

  1. Store trained model.

  2. Report runtimes.

  3. Eval model if required.

Parameter

rtype:

A dict containing information about the training and/or evaluation

write_report() None

Report training related information in a report.json file

start() dict

Start training

# (1) Loading the Data # (2) Create an evaluator object. # (3) Create a trainer object. # (4) Start the training

Parameter

rtype:

A dict containing information about the training and/or evaluation

class dicee.KGE(path=None, url=None, construct_ensemble=False, model_name=None)

Bases: dicee.abstracts.BaseInteractiveKGE, dicee.abstracts.InteractiveQueryDecomposition, dicee.abstracts.BaseInteractiveTrainKGE

Knowledge Graph Embedding Class for interactive usage of pre-trained models

__str__()
to(device: str) None
get_transductive_entity_embeddings(indices: torch.LongTensor | List[str], as_pytorch=False, as_numpy=False, as_list=True) torch.FloatTensor | numpy.ndarray | List[float]
create_vector_database(collection_name: str, distance: str, location: str = 'localhost', port: int = 6333)
generate(h='', r='')
eval_lp_performance(dataset=List[Tuple[str, str, str]], filtered=True)
predict_missing_head_entity(relation: List[str] | str, tail_entity: List[str] | str, within=None, batch_size=2, topk=1, return_indices=False) Tuple

Given a relation and a tail entity, return top k ranked head entity.

argmax_{e in E } f(e,r,t), where r in R, t in E.

Parameter

relation: Union[List[str], str]

String representation of selected relations.

tail_entity: Union[List[str], str]

String representation of selected entities.

k: int

Highest ranked k entities.

Returns: Tuple

Highest K scores and entities

predict_missing_relations(head_entity: List[str] | str, tail_entity: List[str] | str, within=None, batch_size=2, topk=1, return_indices=False) Tuple

Given a head entity and a tail entity, return top k ranked relations.

argmax_{r in R } f(h,r,t), where h, t in E.

Parameter

head_entity: List[str]

String representation of selected entities.

tail_entity: List[str]

String representation of selected entities.

k: int

Highest ranked k entities.

Returns: Tuple

Highest K scores and entities

predict_missing_tail_entity(head_entity: List[str] | str, relation: List[str] | str, within: List[str] = None, batch_size=2, topk=1, return_indices=False) torch.FloatTensor

Given a head entity and a relation, return top k ranked entities

argmax_{e in E } f(h,r,e), where h in E and r in R.

Parameter

head_entity: List[str]

String representation of selected entities.

tail_entity: List[str]

String representation of selected entities.

Returns: Tuple

scores

predict(*, h: List[str] | str = None, r: List[str] | str = None, t: List[str] | str = None, within=None, logits=True) torch.FloatTensor
Parameters:
  • logits

  • h

  • r

  • t

  • within

predict_topk(*, h: str | List[str] = None, r: str | List[str] = None, t: str | List[str] = None, topk: int = 10, within: List[str] = None, batch_size: int = 1024)

Predict missing item in a given triple.

Returns:

  • If you query a single (h, r, ?) or (?, r, t) or (h, ?, t), returns List[(item, score)]

  • If you query a batch of B, returns List of B such lists.

triple_score(h: List[str] | str = None, r: List[str] | str = None, t: List[str] | str = None, logits=False) torch.FloatTensor

Predict triple score

Parameter

head_entity: List[str]

String representation of selected entities.

relation: List[str]

String representation of selected relations.

tail_entity: List[str]

String representation of selected entities.

logits: bool

If logits is True, unnormalized score returned

Returns: Tuple

pytorch tensor of triple score

return_multi_hop_query_results(aggregated_query_for_all_entities, k: int, only_scores)
single_hop_query_answering(query: tuple, only_scores: bool = True, k: int = None)
answer_multi_hop_query(query_type: str = None, query: Tuple[str | Tuple[str, str], Ellipsis] = None, queries: List[Tuple[str | Tuple[str, str], Ellipsis]] = None, tnorm: str = 'prod', neg_norm: str = 'standard', lambda_: float = 0.0, k: int = 10, only_scores=False) List[Tuple[str, torch.Tensor]]

# @TODO: Refactoring is needed # @TODO: Score computation for each query type should be done in a static function

Find an answer set for EPFO queries including negation and disjunction

Parameter

query_type: str The type of the query, e.g., “2p”.

query: Union[str, Tuple[str, Tuple[str, str]]] The query itself, either a string or a nested tuple.

queries: List of Tuple[Union[str, Tuple[str, str]], …]

tnorm: str The t-norm operator.

neg_norm: str The negation norm.

lambda_: float lambda parameter for sugeno and yager negation norms

k: int The top-k substitutions for intermediate variables.

returns:
  • List[Tuple[str, torch.Tensor]]

  • Entities and corresponding scores sorted in the descening order of scores

find_missing_triples(confidence: float, entities: List[str] = None, relations: List[str] = None, topk: int = 10, at_most: int = sys.maxsize) Set

Find missing triples

Iterative over a set of entities E and a set of relation R :

orall e in E and orall r in R f(e,r,x)

Return (e,r,x)

otin G and f(e,r,x) > confidence

confidence: float

A threshold for an output of a sigmoid function given a triple.

topk: int

Highest ranked k item to select triples with f(e,r,x) > confidence .

at_most: int

Stop after finding at_most missing triples

{(e,r,x) | f(e,r,x) > confidence land (e,r,x)

otin G

predict_literals(entity: List[str] | str = None, attribute: List[str] | str = None, denormalize_preds: bool = True) numpy.ndarray

Predicts literal values for given entities and attributes.

Parameters:
  • entity (Union[List[str], str]) – Entity or list of entities to predict literals for.

  • attribute (Union[List[str], str]) – Attribute or list of attributes to predict literals for.

  • denormalize_preds (bool) – If True, denormalizes the predictions.

Returns:

Predictions for the given entities and attributes.

Return type:

numpy ndarray

class dicee.QueryGenerator(train_path, val_path: str, test_path: str, ent2id: Dict = None, rel2id: Dict = None, seed: int = 1, gen_valid: bool = False, gen_test: bool = True)
train_path
val_path
test_path
gen_valid = False
gen_test = True
seed = 1
max_ans_num = 1000000.0
mode
ent2id = None
rel2id: Dict = None
ent_in: Dict
ent_out: Dict
query_name_to_struct
list2tuple(list_data)
tuple2list(x: List | Tuple) List | Tuple

Convert a nested tuple to a nested list.

set_global_seed(seed: int)

Set seed

construct_graph(paths: List[str]) Tuple[Dict, Dict]

Construct graph from triples Returns dicts with incoming and outgoing edges

fill_query(query_structure: List[str | List], ent_in: Dict, ent_out: Dict, answer: int) bool

Private method for fill_query logic.

achieve_answer(query: List[str | List], ent_in: Dict, ent_out: Dict) set

Private method for achieve_answer logic. @TODO: Document the code

ground_queries(query_structure: List[str | List], ent_in: Dict, ent_out: Dict, small_ent_in: Dict, small_ent_out: Dict, gen_num: int, query_name: str)

Generating queries and achieving answers

unmap(query_type, queries, tp_answers, fp_answers, fn_answers)
unmap_query(query_structure, query, id2ent, id2rel)
generate_queries(query_struct: List, gen_num: int, query_type: str)

Passing incoming and outgoing edges to ground queries depending on mode [train valid or text] and getting queries and answers in return @ TODO: create a class for each single query struct

save_queries(query_type: str, gen_num: int, save_path: str)
abstract load_queries(path)
get_queries(query_type: str, gen_num: int)
static save_queries_and_answers(path: str, data: List[Tuple[str, Tuple[collections.defaultdict]]]) None

Save Queries into Disk

static load_queries_and_answers(path: str) List[Tuple[str, Tuple[collections.defaultdict]]]

Load Queries from Disk to Memory

class dicee.DICE_Trainer(args, is_continual_training: bool, storage_path, evaluator=None)
DICE_Trainer implement

1- Pytorch Lightning trainer (https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html) 2- Multi-GPU Trainer(https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) 3- CPU Trainer

args

is_continual_training:bool

storage_path:str

evaluator:

report:dict

report
args
trainer = None
is_continual_training
storage_path
evaluator = None
form_of_labelling = None
continual_start(knowledge_graph)
  1. Initialize training.

  2. Load model

(3) Load trainer (3) Fit model

Parameter

returns:
  • model

  • form_of_labelling (str)

initialize_trainer(callbacks: List) lightning.Trainer | dicee.trainer.model_parallelism.TensorParallel | dicee.trainer.torch_trainer.TorchTrainer | dicee.trainer.torch_trainer_ddp.TorchDDPTrainer

Initialize Trainer from input arguments

initialize_or_load_model()
init_dataloader(dataset: torch.utils.data.Dataset) torch.utils.data.DataLoader
init_dataset() torch.utils.data.Dataset
start(knowledge_graph: dicee.knowledge_graph.KG | numpy.memmap) Tuple[dicee.models.base_model.BaseKGE, str]

Start the training

  1. Initialize Trainer

  2. Initialize or load a pretrained KGE model

in DDP setup, we need to load the memory map of already read/index KG.

k_fold_cross_validation(dataset) Tuple[dicee.models.base_model.BaseKGE, str]

Perform K-fold Cross-Validation

  1. Obtain K train and test splits.

  2. For each split,

    2.1 initialize trainer and model 2.2. Train model with configuration provided in args. 2.3. Compute the mean reciprocal rank (MRR) score of the model on the test respective split.

  3. Report the mean and average MRR .

Parameters:
  • self

  • dataset

Returns:

model

class dicee.Evaluator(args, is_continual_training: bool = False)

Evaluator class for KGE models in various downstream tasks.

Orchestrates link prediction evaluation with different scoring techniques including standard evaluation and byte-pair encoding based evaluation.

er_vocab

Entity-relation to tail vocabulary for filtered ranking.

re_vocab

Relation-entity (tail) to head vocabulary.

ee_vocab

Entity-entity to relation vocabulary.

num_entities

Total number of entities in the knowledge graph.

num_relations

Total number of relations in the knowledge graph.

args

Configuration arguments.

report

Dictionary storing evaluation results.

during_training

Whether evaluation is happening during training.

Example

>>> from dicee.evaluation import Evaluator
>>> evaluator = Evaluator(args)
>>> results = evaluator.eval(dataset, model, 'EntityPrediction')
>>> print(f"Test MRR: {results['Test']['MRR']:.4f}")
re_vocab: Dict | None = None
er_vocab: Dict | None = None
ee_vocab: Dict | None = None
func_triple_to_bpe_representation = None
is_continual_training = False
num_entities: int | None = None
num_relations: int | None = None
domain_constraints_per_rel = None
range_constraints_per_rel = None
args
report: Dict
during_training = False
vocab_preparation(dataset) None

Prepare vocabularies from the dataset for evaluation.

Resolves any future objects and saves vocabularies to disk.

Parameters:

dataset – Knowledge graph dataset with vocabulary attributes.

eval(dataset, trained_model, form_of_labelling: str, during_training: bool = False) Dict | None

Evaluate the trained model on the dataset.

Parameters:
  • dataset – Knowledge graph dataset (KG instance).

  • trained_model – The trained KGE model.

  • form_of_labelling – Type of labelling (‘EntityPrediction’ or ‘RelationPrediction’).

  • during_training – Whether evaluation is during training.

Returns:

Dictionary of evaluation metrics, or None if evaluation is skipped.

eval_rank_of_head_and_tail_entity(*, train_set, valid_set=None, test_set=None, trained_model) None

Evaluate with negative sampling scoring.

eval_rank_of_head_and_tail_byte_pair_encoded_entity(*, train_set=None, valid_set=None, test_set=None, ordered_bpe_entities, trained_model) None

Evaluate with BPE-encoded entities and negative sampling.

eval_with_byte(*, raw_train_set, raw_valid_set=None, raw_test_set=None, trained_model, form_of_labelling) None

Evaluate BytE model with generation.

eval_with_bpe_vs_all(*, raw_train_set, raw_valid_set=None, raw_test_set=None, trained_model, form_of_labelling) None

Evaluate with BPE and KvsAll scoring.

eval_with_vs_all(*, train_set, valid_set=None, test_set=None, trained_model, form_of_labelling) None

Evaluate with KvsAll or 1vsAll scoring.

evaluate_lp_k_vs_all(model, triple_idx, info: str = None, form_of_labelling: str = None) Dict[str, float]

Filtered link prediction evaluation with KvsAll scoring.

Parameters:
  • model – The trained model to evaluate.

  • triple_idx – Integer-indexed test triples.

  • info – Description to print.

  • form_of_labelling – ‘EntityPrediction’ or ‘RelationPrediction’.

Returns:

Dictionary with H@1, H@3, H@10, and MRR metrics.

evaluate_lp_with_byte(model, triples: List[List[str]], info: str = None) Dict[str, float]

Evaluate BytE model with text generation.

Parameters:
  • model – BytE model.

  • triples – String triples.

  • info – Description to print.

Returns:

Dictionary with placeholder metrics (-1 values).

evaluate_lp_bpe_k_vs_all(model, triples: List[List[str]], info: str = None, form_of_labelling: str = None) Dict[str, float]

Evaluate BPE model with KvsAll scoring.

Parameters:
  • model – BPE-enabled model.

  • triples – String triples.

  • info – Description to print.

  • form_of_labelling – Type of labelling.

Returns:

Dictionary with H@1, H@3, H@10, and MRR metrics.

evaluate_lp(model, triple_idx, info: str) Dict[str, float]

Evaluate link prediction with negative sampling.

Parameters:
  • model – The model to evaluate.

  • triple_idx – Integer-indexed triples.

  • info – Description to print.

Returns:

Dictionary with H@1, H@3, H@10, and MRR metrics.

dummy_eval(trained_model, form_of_labelling: str) None

Run evaluation from saved data (for continual training).

Parameters:
  • trained_model – The trained model.

  • form_of_labelling – Type of labelling.

eval_with_data(dataset, trained_model, triple_idx: numpy.ndarray, form_of_labelling: str) Dict[str, float]

Evaluate a trained model on a given dataset.

Parameters:
  • dataset – Knowledge graph dataset.

  • trained_model – The trained model.

  • triple_idx – Integer-indexed triples to evaluate.

  • form_of_labelling – Type of labelling.

Returns:

Dictionary with evaluation metrics.

Raises:

ValueError – If scoring technique is invalid.

dicee.__version__ = '0.1.5'