dicee

DICE Embeddings - Knowledge Graph Embedding Library.

A library for training and using knowledge graph embedding models with support for various scoring techniques and training strategies.

Submodules:

evaluation: Model evaluation functions and Evaluator class models: KGE model implementations trainer: Training orchestration scripts: Utility scripts

Submodules

Attributes

__version__

Classes

Execute

Executor class for training, retraining and evaluating KGE models.

KGE

Knowledge Graph Embedding Class for interactive usage of pre-trained models

QueryGenerator

DICE_Trainer

DICE_Trainer implement

Evaluator

Evaluator class for KGE models in various downstream tasks.

Package Contents

class dicee.Execute(args, continuous_training: bool = False)[source]

Executor class for training, retraining and evaluating KGE models.

Handles the complete workflow: 1. Loading & Preprocessing & Serializing input data 2. Training & Validation & Testing 3. Storing all necessary information

args

Processed input arguments.

distributed

Whether distributed training is enabled.

rank

Process rank in distributed training.

world_size

Total number of processes.

local_rank

Local GPU rank.

trainer

Training handler instance.

trained_model

The trained model after training completes.

knowledge_graph

The loaded knowledge graph.

report

Dictionary storing training metrics and results.

evaluator

Model evaluation handler.

distributed
args
is_continual_training = False
trainer: dicee.trainer.DICE_Trainer | None = None
trained_model = None
knowledge_graph: dicee.knowledge_graph.KG | None = None
report: Dict
evaluator: dicee.evaluator.Evaluator | None = None
start_time: float | None = None
is_rank_zero() bool[source]
cleanup()[source]
setup_executor() None[source]

Set up storage directories for the experiment.

Creates or reuses experiment directories based on configuration. Saves the configuration to a JSON file.

create_and_store_kg() None[source]

Create knowledge graph and store as memory-mapped file.

Only executed on rank 0 in distributed training. Skips if memmap already exists.

load_from_memmap() None[source]

Load knowledge graph from memory-mapped file.

save_trained_model() None[source]

Save a knowledge graph embedding model

  1. Send model to eval mode and cpu.

  2. Store the memory footprint of the model.

  3. Save the model into disk.

  4. Update the stats of KG again ?

Parameter

rtype:

None

end(form_of_labelling: str) dict[source]

End training

  1. Store trained model.

  2. Report runtimes.

  3. Eval model if required.

Parameter

rtype:

A dict containing information about the training and/or evaluation

write_report() None[source]

Report training related information in a report.json file

start() dict[source]

Start training

# (1) Loading the Data # (2) Create an evaluator object. # (3) Create a trainer object. # (4) Start the training

Parameter

rtype:

A dict containing information about the training and/or evaluation

class dicee.KGE(path=None, url=None, construct_ensemble=False, model_name=None)[source]

Bases: dicee.abstracts.BaseInteractiveKGE, dicee.abstracts.InteractiveQueryDecomposition, dicee.abstracts.BaseInteractiveTrainKGE

Knowledge Graph Embedding Class for interactive usage of pre-trained models

__str__()[source]
to(device: str) None[source]
get_transductive_entity_embeddings(indices: torch.LongTensor | List[str], as_pytorch=False, as_numpy=False, as_list=True) torch.FloatTensor | numpy.ndarray | List[float][source]
create_vector_database(collection_name: str, distance: str, location: str = 'localhost', port: int = 6333)[source]
generate(h='', r='')[source]
eval_lp_performance(dataset=List[Tuple[str, str, str]], filtered=True)[source]
predict_missing_head_entity(relation: List[str] | str, tail_entity: List[str] | str, within=None, batch_size=2, topk=1, return_indices=False) Tuple[source]

Given a relation and a tail entity, return top k ranked head entity.

argmax_{e in E } f(e,r,t), where r in R, t in E.

Parameter

relation: Union[List[str], str]

String representation of selected relations.

tail_entity: Union[List[str], str]

String representation of selected entities.

k: int

Highest ranked k entities.

Returns: Tuple

Highest K scores and entities

predict_missing_relations(head_entity: List[str] | str, tail_entity: List[str] | str, within=None, batch_size=2, topk=1, return_indices=False) Tuple[source]

Given a head entity and a tail entity, return top k ranked relations.

argmax_{r in R } f(h,r,t), where h, t in E.

Parameter

head_entity: List[str]

String representation of selected entities.

tail_entity: List[str]

String representation of selected entities.

k: int

Highest ranked k entities.

Returns: Tuple

Highest K scores and entities

predict_missing_tail_entity(head_entity: List[str] | str, relation: List[str] | str, within: List[str] = None, batch_size=2, topk=1, return_indices=False) torch.FloatTensor[source]

Given a head entity and a relation, return top k ranked entities

argmax_{e in E } f(h,r,e), where h in E and r in R.

Parameter

head_entity: List[str]

String representation of selected entities.

tail_entity: List[str]

String representation of selected entities.

Returns: Tuple

scores

predict(*, h: List[str] | str = None, r: List[str] | str = None, t: List[str] | str = None, within=None, logits=True) torch.FloatTensor[source]
Parameters:
  • logits

  • h

  • r

  • t

  • within

predict_topk(*, h: str | List[str] = None, r: str | List[str] = None, t: str | List[str] = None, topk: int = 10, within: List[str] = None, batch_size: int = 1024)[source]

Predict missing item in a given triple.

Returns:

  • If you query a single (h, r, ?) or (?, r, t) or (h, ?, t), returns List[(item, score)]

  • If you query a batch of B, returns List of B such lists.

triple_score(h: List[str] | str = None, r: List[str] | str = None, t: List[str] | str = None, logits=False) torch.FloatTensor[source]

Predict triple score

Parameter

head_entity: List[str]

String representation of selected entities.

relation: List[str]

String representation of selected relations.

tail_entity: List[str]

String representation of selected entities.

logits: bool

If logits is True, unnormalized score returned

Returns: Tuple

pytorch tensor of triple score

return_multi_hop_query_results(aggregated_query_for_all_entities, k: int, only_scores)[source]
single_hop_query_answering(query: tuple, only_scores: bool = True, k: int = None)[source]
answer_multi_hop_query(query_type: str = None, query: Tuple[str | Tuple[str, str], Ellipsis] = None, queries: List[Tuple[str | Tuple[str, str], Ellipsis]] = None, tnorm: str = 'prod', neg_norm: str = 'standard', lambda_: float = 0.0, k: int = 10, only_scores=False) List[Tuple[str, torch.Tensor]][source]

# @TODO: Refactoring is needed # @TODO: Score computation for each query type should be done in a static function

Find an answer set for EPFO queries including negation and disjunction

Parameter

query_type: str The type of the query, e.g., “2p”.

query: Union[str, Tuple[str, Tuple[str, str]]] The query itself, either a string or a nested tuple.

queries: List of Tuple[Union[str, Tuple[str, str]], …]

tnorm: str The t-norm operator.

neg_norm: str The negation norm.

lambda_: float lambda parameter for sugeno and yager negation norms

k: int The top-k substitutions for intermediate variables.

returns:
  • List[Tuple[str, torch.Tensor]]

  • Entities and corresponding scores sorted in the descening order of scores

find_missing_triples(confidence: float, entities: List[str] = None, relations: List[str] = None, topk: int = 10, at_most: int = sys.maxsize) Set[source]

Find missing triples

Iterative over a set of entities E and a set of relation R :

orall e in E and orall r in R f(e,r,x)

Return (e,r,x)

otin G and f(e,r,x) > confidence

confidence: float

A threshold for an output of a sigmoid function given a triple.

topk: int

Highest ranked k item to select triples with f(e,r,x) > confidence .

at_most: int

Stop after finding at_most missing triples

{(e,r,x) | f(e,r,x) > confidence land (e,r,x)

otin G

predict_literals(entity: List[str] | str = None, attribute: List[str] | str = None, denormalize_preds: bool = True) numpy.ndarray[source]

Predicts literal values for given entities and attributes.

Parameters:
  • entity (Union[List[str], str]) – Entity or list of entities to predict literals for.

  • attribute (Union[List[str], str]) – Attribute or list of attributes to predict literals for.

  • denormalize_preds (bool) – If True, denormalizes the predictions.

Returns:

Predictions for the given entities and attributes.

Return type:

numpy ndarray

class dicee.QueryGenerator(train_path, val_path: str, test_path: str, ent2id: Dict = None, rel2id: Dict = None, seed: int = 1, gen_valid: bool = False, gen_test: bool = True)[source]
train_path
val_path
test_path
gen_valid = False
gen_test = True
seed = 1
max_ans_num = 1000000.0
mode
ent2id = None
rel2id: Dict = None
ent_in: Dict
ent_out: Dict
query_name_to_struct
list2tuple(list_data)[source]
tuple2list(x: List | Tuple) List | Tuple[source]

Convert a nested tuple to a nested list.

set_global_seed(seed: int)[source]

Set seed

construct_graph(paths: List[str]) Tuple[Dict, Dict][source]

Construct graph from triples Returns dicts with incoming and outgoing edges

fill_query(query_structure: List[str | List], ent_in: Dict, ent_out: Dict, answer: int) bool[source]

Private method for fill_query logic.

achieve_answer(query: List[str | List], ent_in: Dict, ent_out: Dict) set[source]

Private method for achieve_answer logic. @TODO: Document the code

ground_queries(query_structure: List[str | List], ent_in: Dict, ent_out: Dict, small_ent_in: Dict, small_ent_out: Dict, gen_num: int, query_name: str)[source]

Generating queries and achieving answers

unmap(query_type, queries, tp_answers, fp_answers, fn_answers)[source]
unmap_query(query_structure, query, id2ent, id2rel)[source]
generate_queries(query_struct: List, gen_num: int, query_type: str)[source]

Passing incoming and outgoing edges to ground queries depending on mode [train valid or text] and getting queries and answers in return @ TODO: create a class for each single query struct

save_queries(query_type: str, gen_num: int, save_path: str)[source]
abstractmethod load_queries(path)[source]
get_queries(query_type: str, gen_num: int)[source]
static save_queries_and_answers(path: str, data: List[Tuple[str, Tuple[collections.defaultdict]]]) None[source]

Save Queries into Disk

static load_queries_and_answers(path: str) List[Tuple[str, Tuple[collections.defaultdict]]][source]

Load Queries from Disk to Memory

class dicee.DICE_Trainer(args, is_continual_training: bool, storage_path, evaluator=None)[source]
DICE_Trainer implement

1- Pytorch Lightning trainer (https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html) 2- Multi-GPU Trainer(https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) 3- CPU Trainer

args

is_continual_training:bool

storage_path:str

evaluator:

report:dict

report
args
trainer = None
is_continual_training
storage_path
evaluator = None
form_of_labelling = None
continual_start(knowledge_graph)[source]
  1. Initialize training.

  2. Load model

(3) Load trainer (3) Fit model

Parameter

returns:
  • model

  • form_of_labelling (str)

initialize_trainer(callbacks: List) lightning.Trainer | dicee.trainer.model_parallelism.TensorParallel | dicee.trainer.torch_trainer.TorchTrainer | dicee.trainer.torch_trainer_ddp.TorchDDPTrainer[source]

Initialize Trainer from input arguments

initialize_or_load_model()[source]
init_dataloader(dataset: torch.utils.data.Dataset) torch.utils.data.DataLoader[source]
init_dataset() torch.utils.data.Dataset[source]
start(knowledge_graph: dicee.knowledge_graph.KG | numpy.memmap) Tuple[dicee.models.base_model.BaseKGE, str][source]

Start the training

  1. Initialize Trainer

  2. Initialize or load a pretrained KGE model

in DDP setup, we need to load the memory map of already read/index KG.

k_fold_cross_validation(dataset) Tuple[dicee.models.base_model.BaseKGE, str][source]

Perform K-fold Cross-Validation

  1. Obtain K train and test splits.

  2. For each split,

    2.1 initialize trainer and model 2.2. Train model with configuration provided in args. 2.3. Compute the mean reciprocal rank (MRR) score of the model on the test respective split.

  3. Report the mean and average MRR .

Parameters:
  • self

  • dataset

Returns:

model

class dicee.Evaluator(args, is_continual_training: bool = False)[source]

Evaluator class for KGE models in various downstream tasks.

Orchestrates link prediction evaluation with different scoring techniques including standard evaluation and byte-pair encoding based evaluation.

er_vocab

Entity-relation to tail vocabulary for filtered ranking.

re_vocab

Relation-entity (tail) to head vocabulary.

ee_vocab

Entity-entity to relation vocabulary.

num_entities

Total number of entities in the knowledge graph.

num_relations

Total number of relations in the knowledge graph.

args

Configuration arguments.

report

Dictionary storing evaluation results.

during_training

Whether evaluation is happening during training.

Example

>>> from dicee.evaluation import Evaluator
>>> evaluator = Evaluator(args)
>>> results = evaluator.eval(dataset, model, 'EntityPrediction')
>>> print(f"Test MRR: {results['Test']['MRR']:.4f}")
re_vocab: Dict | None = None
er_vocab: Dict | None = None
ee_vocab: Dict | None = None
func_triple_to_bpe_representation = None
is_continual_training = False
num_entities: int | None = None
num_relations: int | None = None
domain_constraints_per_rel = None
range_constraints_per_rel = None
args
report: Dict
during_training = False
vocab_preparation(dataset) None[source]

Prepare vocabularies from the dataset for evaluation.

Resolves any future objects and saves vocabularies to disk.

Parameters:

dataset – Knowledge graph dataset with vocabulary attributes.

eval(dataset, trained_model, form_of_labelling: str, during_training: bool = False) Dict | None[source]

Evaluate the trained model on the dataset.

Parameters:
  • dataset – Knowledge graph dataset (KG instance).

  • trained_model – The trained KGE model.

  • form_of_labelling – Type of labelling (‘EntityPrediction’ or ‘RelationPrediction’).

  • during_training – Whether evaluation is during training.

Returns:

Dictionary of evaluation metrics, or None if evaluation is skipped.

eval_rank_of_head_and_tail_entity(*, train_set, valid_set=None, test_set=None, trained_model) None[source]

Evaluate with negative sampling scoring.

eval_rank_of_head_and_tail_byte_pair_encoded_entity(*, train_set=None, valid_set=None, test_set=None, ordered_bpe_entities, trained_model) None[source]

Evaluate with BPE-encoded entities and negative sampling.

eval_with_byte(*, raw_train_set, raw_valid_set=None, raw_test_set=None, trained_model, form_of_labelling) None[source]

Evaluate BytE model with generation.

eval_with_bpe_vs_all(*, raw_train_set, raw_valid_set=None, raw_test_set=None, trained_model, form_of_labelling) None[source]

Evaluate with BPE and KvsAll scoring.

eval_with_vs_all(*, train_set, valid_set=None, test_set=None, trained_model, form_of_labelling) None[source]

Evaluate with KvsAll or 1vsAll scoring.

evaluate_lp_k_vs_all(model, triple_idx, info: str | None = None, form_of_labelling: str | None = None) Dict[str, float][source]

Filtered link prediction evaluation with KvsAll scoring.

Parameters:
  • model – The trained model to evaluate.

  • triple_idx – Integer-indexed test triples.

  • info – Description to print.

  • form_of_labelling – ‘EntityPrediction’ or ‘RelationPrediction’.

Returns:

Dictionary with H@1, H@3, H@10, and MRR metrics.

evaluate_lp_with_byte(model, triples: List[List[str]], info: str | None = None) Dict[str, float][source]

Evaluate BytE model with text generation.

Parameters:
  • model – BytE model.

  • triples – String triples.

  • info – Description to print.

Returns:

Dictionary with placeholder metrics (-1 values).

evaluate_lp_bpe_k_vs_all(model, triples: List[List[str]], info: str | None = None, form_of_labelling: str | None = None) Dict[str, float][source]

Evaluate BPE model with KvsAll scoring.

Parameters:
  • model – BPE-enabled model.

  • triples – String triples.

  • info – Description to print.

  • form_of_labelling – Type of labelling.

Returns:

Dictionary with H@1, H@3, H@10, and MRR metrics.

evaluate_lp(model, triple_idx, info: str) Dict[str, float][source]

Evaluate link prediction with negative sampling.

Parameters:
  • model – The model to evaluate.

  • triple_idx – Integer-indexed triples.

  • info – Description to print.

Returns:

Dictionary with H@1, H@3, H@10, and MRR metrics.

dummy_eval(trained_model, form_of_labelling: str) None[source]

Run evaluation from saved data (for continual training).

Parameters:
  • trained_model – The trained model.

  • form_of_labelling – Type of labelling.

eval_with_data(dataset, trained_model, triple_idx: numpy.ndarray, form_of_labelling: str) Dict[str, float][source]

Evaluate a trained model on a given dataset.

Parameters:
  • dataset – Knowledge graph dataset.

  • trained_model – The trained model.

  • triple_idx – Integer-indexed triples to evaluate.

  • form_of_labelling – Type of labelling.

Returns:

Dictionary with evaluation metrics.

Raises:

ValueError – If scoring technique is invalid.

dicee.__version__ = '0.3.2'