dicee.evaluation

Evaluation module for knowledge graph embedding models.

This module provides comprehensive evaluation capabilities for KGE models, including link prediction, literal prediction, and ensemble evaluation.

Modules:: link_prediction: Functions for evaluating link prediction performance literal_prediction: Functions for evaluating literal/attribute prediction ensemble: Functions for ensemble model evaluation evaluator: Main Evaluator class for integrated evaluation utils: Shared utility functions for evaluation

Example

>>> from dicee.evaluation import Evaluator
>>> from dicee.evaluation.link_prediction import evaluate_link_prediction_performance
>>> from dicee.evaluation.ensemble import evaluate_ensemble_link_prediction_performance

Submodules

Classes

Evaluator

Evaluator class for KGE models in various downstream tasks.

Functions

`evaluate_link_prediction_performance`(→ Dict[str, float])	Evaluate link prediction performance with head and tail prediction.
`evaluate_link_prediction_performance_with_reciprocals`(...)	Evaluate link prediction with reciprocal relations.
`evaluate_link_prediction_performance_with_bpe`(...)	Evaluate link prediction with BPE encoding (head and tail).
`evaluate_link_prediction_performance_with_bpe_reciprocals`(...)	Evaluate link prediction with BPE encoding and reciprocals.
`evaluate_lp`(→ Dict[str, float])	Evaluate link prediction with batched processing.
`evaluate_lp_bpe_k_vs_all`(→ Dict[str, float])	Evaluate BPE link prediction with KvsAll scoring.
`evaluate_bpe_lp`(→ Dict[str, float])	Evaluate link prediction with BPE-encoded entities.
`evaluate_literal_prediction`(→ Optional[pandas.DataFrame])	Evaluate trained literal prediction model on a test file.
`evaluate_ensemble_link_prediction_performance`(...)	Evaluate link prediction performance of an ensemble of KGE models.
`compute_metrics_from_ranks`(→ Dict[str, float])	Compute standard link prediction metrics from ranks.
`make_iterable_verbose`(→ Iterable)	Wrap an iterable with tqdm progress bar if verbose is True.

Package Contents

class dicee.evaluation.Evaluator(args, is_continual_training: bool = False)

Evaluator class for KGE models in various downstream tasks.

Orchestrates link prediction evaluation with different scoring techniques including standard evaluation and byte-pair encoding based evaluation.

er_vocab: Entity-relation to tail vocabulary for filtered ranking.

re_vocab: Relation-entity (tail) to head vocabulary.

ee_vocab: Entity-entity to relation vocabulary.

num_entities: Total number of entities in the knowledge graph.

num_relations: Total number of relations in the knowledge graph.

args: Configuration arguments.

report: Dictionary storing evaluation results.

during_training: Whether evaluation is happening during training.

Example

>>> from dicee.evaluation import Evaluator
>>> evaluator = Evaluator(args)
>>> results = evaluator.eval(dataset, model, 'EntityPrediction')
>>> print(f"Test MRR: {results['Test']['MRR']:.4f}")

re_vocab: Dict | None = None

er_vocab: Dict | None = None

ee_vocab: Dict | None = None

func_triple_to_bpe_representation = None

is_continual_training = False

num_entities: int | None = None

num_relations: int | None = None

domain_constraints_per_rel = None

range_constraints_per_rel = None

args

report: Dict

during_training = False

vocab_preparation(dataset) → None

Prepare vocabularies from the dataset for evaluation.

Resolves any future objects and saves vocabularies to disk.

Parameters:: dataset – Knowledge graph dataset with vocabulary attributes.

eval(dataset, trained_model, form_of_labelling: str, during_training: bool = False) → Dict | None

Evaluate the trained model on the dataset.

Parameters:

dataset – Knowledge graph dataset (KG instance).
trained_model – The trained KGE model.
form_of_labelling – Type of labelling (‘EntityPrediction’ or ‘RelationPrediction’).
during_training – Whether evaluation is during training.

Returns:

Dictionary of evaluation metrics, or None if evaluation is skipped.

eval_rank_of_head_and_tail_entity(*, train_set, valid_set=None, test_set=None, trained_model) → None: Evaluate with negative sampling scoring.

eval_rank_of_head_and_tail_byte_pair_encoded_entity(*, train_set=None, valid_set=None, test_set=None, ordered_bpe_entities, trained_model) → None: Evaluate with BPE-encoded entities and negative sampling.

eval_with_byte(*, raw_train_set, raw_valid_set=None, raw_test_set=None, trained_model, form_of_labelling) → None: Evaluate BytE model with generation.

eval_with_bpe_vs_all(*, raw_train_set, raw_valid_set=None, raw_test_set=None, trained_model, form_of_labelling) → None: Evaluate with BPE and KvsAll scoring.

eval_with_vs_all(*, train_set, valid_set=None, test_set=None, trained_model, form_of_labelling) → None: Evaluate with KvsAll or 1vsAll scoring.

evaluate_lp_k_vs_all(model, triple_idx, info: str = None, form_of_labelling: str = None) → Dict[str, float]

Filtered link prediction evaluation with KvsAll scoring.

Parameters:

model – The trained model to evaluate.
triple_idx – Integer-indexed test triples.
info – Description to print.
form_of_labelling – ‘EntityPrediction’ or ‘RelationPrediction’.

Returns:

Dictionary with H@1, H@3, H@10, and MRR metrics.

evaluate_lp_with_byte(model, triples: List[List[str]], info: str = None) → Dict[str, float]

Evaluate BytE model with text generation.

Parameters:

model – BytE model.
triples – String triples.
info – Description to print.

Returns:

Dictionary with placeholder metrics (-1 values).

evaluate_lp_bpe_k_vs_all(model, triples: List[List[str]], info: str = None, form_of_labelling: str = None) → Dict[str, float]

Evaluate BPE model with KvsAll scoring.

Parameters:

model – BPE-enabled model.
triples – String triples.
info – Description to print.
form_of_labelling – Type of labelling.

Returns:

Dictionary with H@1, H@3, H@10, and MRR metrics.

evaluate_lp(model, triple_idx, info: str) → Dict[str, float]

Evaluate link prediction with negative sampling.

Parameters:

model – The model to evaluate.
triple_idx – Integer-indexed triples.
info – Description to print.

Returns:

Dictionary with H@1, H@3, H@10, and MRR metrics.

dummy_eval(trained_model, form_of_labelling: str) → None

Run evaluation from saved data (for continual training).

Parameters:

trained_model – The trained model.
form_of_labelling – Type of labelling.

eval_with_data(dataset, trained_model, triple_idx: numpy.ndarray, form_of_labelling: str) → Dict[str, float]

Evaluate a trained model on a given dataset.

Parameters:

dataset – Knowledge graph dataset.
trained_model – The trained model.
triple_idx – Integer-indexed triples to evaluate.
form_of_labelling – Type of labelling.

Returns:

Dictionary with evaluation metrics.

Raises:

ValueError – If scoring technique is invalid.

dicee.evaluation.evaluate_link_prediction_performance(model, triples, er_vocab: Dict[Tuple, List], re_vocab: Dict[Tuple, List]) → Dict[str, float]

Evaluate link prediction performance with head and tail prediction.

Performs filtered evaluation where known correct answers are filtered out before computing ranks.

Parameters:

model – KGE model wrapper with entity/relation mappings.
triples – Test triples as list of (head, relation, tail) strings.
er_vocab – Mapping (entity, relation) -> list of valid tail entities.
re_vocab – Mapping (relation, entity) -> list of valid head entities.

Returns:

Dictionary with H@1, H@3, H@10, and MRR metrics.

dicee.evaluation.evaluate_link_prediction_performance_with_reciprocals(model, triples, er_vocab: Dict[Tuple, List]) → Dict[str, float]

Evaluate link prediction with reciprocal relations.

Optimized for models trained with reciprocal triples where only tail prediction is needed.

Parameters:

model – KGE model wrapper.
triples – Test triples as list of (head, relation, tail) strings.
er_vocab – Mapping (entity, relation) -> list of valid tail entities.

Returns:

Dictionary with H@1, H@3, H@10, and MRR metrics.

dicee.evaluation.evaluate_link_prediction_performance_with_bpe(model, within_entities: List[str], triples: List[Tuple[str]], er_vocab: Dict[Tuple, List], re_vocab: Dict[Tuple, List]) → Dict[str, float]

Evaluate link prediction with BPE encoding (head and tail).

Parameters:

model – KGE model wrapper with BPE support.
within_entities – List of entities to evaluate within.
triples – Test triples as list of (head, relation, tail) tuples.
er_vocab – Mapping (entity, relation) -> list of valid tail entities.
re_vocab – Mapping (relation, entity) -> list of valid head entities.

Returns:

Dictionary with H@1, H@3, H@10, and MRR metrics.

dicee.evaluation.evaluate_link_prediction_performance_with_bpe_reciprocals(model, within_entities: List[str], triples: List[List[str]], er_vocab: Dict[Tuple, List]) → Dict[str, float]

Evaluate link prediction with BPE encoding and reciprocals.

Parameters:

model – KGE model wrapper with BPE support.
within_entities – List of entities to evaluate within.
triples – Test triples as list of [head, relation, tail] strings.
er_vocab – Mapping (entity, relation) -> list of valid tail entities.

Returns:

Dictionary with H@1, H@3, H@10, and MRR metrics.

dicee.evaluation.evaluate_lp(model, triple_idx, num_entities: int, er_vocab: Dict[Tuple, List], re_vocab: Dict[Tuple, List], info: str = 'Eval Starts', batch_size: int = 128, chunk_size: int = 1000) → Dict[str, float]

Evaluate link prediction with batched processing.

Memory-efficient evaluation using chunked entity scoring.

Parameters:

model – The KGE model to evaluate.
triple_idx – Integer-indexed triples as numpy array.
num_entities – Total number of entities.
er_vocab – Mapping (head_idx, rel_idx) -> list of tail indices.
re_vocab – Mapping (rel_idx, tail_idx) -> list of head indices.
info – Description to print.
batch_size – Batch size for triple processing.
chunk_size – Chunk size for entity scoring.

Returns:

Dictionary with H@1, H@3, H@10, and MRR metrics.

dicee.evaluation.evaluate_lp_bpe_k_vs_all(model, triples: List[List[str]], er_vocab: Dict = None, batch_size: int = None, func_triple_to_bpe_representation: Callable = None, str_to_bpe_entity_to_idx: Dict = None) → Dict[str, float]

Evaluate BPE link prediction with KvsAll scoring.

Parameters:

model – The KGE model wrapper.
triples – List of string triples.
er_vocab – Entity-relation vocabulary for filtering.
batch_size – Batch size for processing.
func_triple_to_bpe_representation – Function to convert triples to BPE.
str_to_bpe_entity_to_idx – Mapping from string entities to BPE indices.

Returns:

Dictionary with H@1, H@3, H@10, and MRR metrics.

dicee.evaluation.evaluate_bpe_lp(model, triple_idx: List[Tuple], all_bpe_shaped_entities, er_vocab: Dict[Tuple, List], re_vocab: Dict[Tuple, List], info: str = 'Eval Starts') → Dict[str, float]

Evaluate link prediction with BPE-encoded entities.

Parameters:

model – The KGE model to evaluate.
triple_idx – List of BPE-encoded triple tuples.
all_bpe_shaped_entities – All entities with BPE representations.
er_vocab – Mapping for tail filtering.
re_vocab – Mapping for head filtering.
info – Description to print.

Returns:

Dictionary with H@1, H@3, H@10, and MRR metrics.

dicee.evaluation.evaluate_literal_prediction(kge_model, eval_file_path: str = None, store_lit_preds: bool = True, eval_literals: bool = True, loader_backend: str = 'pandas', return_attr_error_metrics: bool = False) → pandas.DataFrame | None

Evaluate trained literal prediction model on a test file.

Evaluates the literal prediction capabilities of a KGE model by computing MAE and RMSE metrics for each attribute.

Parameters:

kge_model – Trained KGE model with literal prediction capability.
eval_file_path – Path to the evaluation file containing test literals.
store_lit_preds – If True, stores predictions to CSV file.
eval_literals – If True, evaluates and prints error metrics.
loader_backend – Backend for loading dataset (‘pandas’ or ‘rdflib’).
return_attr_error_metrics – If True, returns the metrics DataFrame.

Returns:

DataFrame with per-attribute MAE and RMSE if return_attr_error_metrics is True, otherwise None.

Raises:

RuntimeError – If the KGE model doesn’t have a trained literal model.
AssertionError – If model is invalid or test set has no valid data.

Example

>>> from dicee import KGE
>>> from dicee.evaluation import evaluate_literal_prediction
>>> model = KGE(path="pretrained_model")
>>> metrics = evaluate_literal_prediction(
...     model,
...     eval_file_path="test_literals.csv",
...     return_attr_error_metrics=True
... )
>>> print(metrics)

dicee.evaluation.evaluate_ensemble_link_prediction_performance(models: List, triples, er_vocab: Dict[Tuple, List], weights: List[float] | None = None, batch_size: int = 512, weighted_averaging: bool = True, normalize_scores: bool = True) → Dict[str, float]

Evaluate link prediction performance of an ensemble of KGE models.

Combines predictions from multiple models using weighted or simple averaging, with optional score normalization.

Parameters:

models – List of KGE models (e.g., snapshots from training).
triples – Test triples as numpy array or list, shape (N, 3), with integer indices (head, relation, tail).
er_vocab – Mapping (head_idx, rel_idx) -> list of tail indices for filtered evaluation.
weights – Weights for model averaging. Required if weighted_averaging is True. Must sum to 1 for proper averaging.
batch_size – Batch size for processing triples.
weighted_averaging – If True, use weighted averaging of predictions. If False, use simple mean.
normalize_scores – If True, normalize scores to [0, 1] range per sample before averaging.

Returns:

Dictionary with H@1, H@3, H@10, and MRR metrics.

Raises:

AssertionError – If weighted_averaging is True but weights are not provided or have wrong length.

Example

>>> from dicee.evaluation import evaluate_ensemble_link_prediction_performance
>>> models = [model1, model2, model3]
>>> weights = [0.5, 0.3, 0.2]
>>> results = evaluate_ensemble_link_prediction_performance(
...     models, test_triples, er_vocab,
...     weights=weights, weighted_averaging=True
... )
>>> print(f"MRR: {results['MRR']:.4f}")

dicee.evaluation.compute_metrics_from_ranks(ranks: List[int], num_triples: int, hits_dict: Dict[int, List[float]], scale_factor: int = 1) → Dict[str, float]

Compute standard link prediction metrics from ranks.

Parameters:

ranks – List of ranks for each prediction.
num_triples – Total number of triples evaluated.
hits_dict – Dictionary mapping hit levels to lists of hits.
scale_factor – Factor to scale the denominator (e.g., 2 for head+tail).

Returns:

Dictionary containing H@1, H@3, H@10, and MRR metrics.

dicee.evaluation.make_iterable_verbose(iterable_object: Iterable, verbose: bool, desc: str = 'Default', position: int = None, leave: bool = True) → Iterable

Wrap an iterable with tqdm progress bar if verbose is True.

Parameters:

iterable_object – The iterable to potentially wrap.
verbose – Whether to show progress bar.
desc – Description for the progress bar.
position – Position of the progress bar.
leave – Whether to leave the progress bar after completion.

Returns:

The original iterable or a tqdm-wrapped version.