dicee
DICE Embeddings - Knowledge Graph Embedding Library.
A library for training and using knowledge graph embedding models with support for various scoring techniques and training strategies.
- Submodules:
evaluation: Model evaluation functions and Evaluator class models: KGE model implementations trainer: Training orchestration scripts: Utility scripts
Submodules
- dicee.__main__
- dicee.abstracts
- dicee.analyse_experiments
- dicee.callbacks
- dicee.config
- dicee.dataset_classes
- dicee.eval_static_funcs
- dicee.evaluation
- dicee.evaluator
- dicee.executer
- dicee.knowledge_graph
- dicee.knowledge_graph_embeddings
- dicee.models
- dicee.query_generator
- dicee.read_preprocess_save_load_kg
- dicee.sanity_checkers
- dicee.scripts
- dicee.static_funcs
- dicee.static_funcs_training
- dicee.static_preprocess_funcs
- dicee.trainer
- dicee.weight_averaging
Attributes
Classes
Executor class for training, retraining and evaluating KGE models. |
|
Knowledge Graph Embedding Class for interactive usage of pre-trained models |
|
DICE_Trainer implement |
|
Evaluator class for KGE models in various downstream tasks. |
Package Contents
- class dicee.Execute(args, continuous_training: bool = False)[source]
Executor class for training, retraining and evaluating KGE models.
Handles the complete workflow: 1. Loading & Preprocessing & Serializing input data 2. Training & Validation & Testing 3. Storing all necessary information
- args
Processed input arguments.
- distributed
Whether distributed training is enabled.
- rank
Process rank in distributed training.
- world_size
Total number of processes.
- local_rank
Local GPU rank.
- trainer
Training handler instance.
- trained_model
The trained model after training completes.
- knowledge_graph
The loaded knowledge graph.
- report
Dictionary storing training metrics and results.
- evaluator
Model evaluation handler.
- distributed
- rank
- world_size
- local_rank
- args
- is_continual_training = False
- trainer: dicee.trainer.DICE_Trainer | None = None
- trained_model = None
- knowledge_graph: dicee.knowledge_graph.KG | None = None
- report: Dict
- evaluator: dicee.evaluator.Evaluator | None = None
- start_time: float | None = None
- setup_executor() None[source]
Set up storage directories for the experiment.
Creates or reuses experiment directories based on configuration. Saves the configuration to a JSON file.
- create_and_store_kg() None[source]
Create knowledge graph and store as memory-mapped file.
Only executed on local rank 0 in distributed training. Skips if memmap already exists.
- save_trained_model() None[source]
Save a knowledge graph embedding model
Send model to eval mode and cpu.
Store the memory footprint of the model.
Save the model into disk.
Update the stats of KG again ?
Parameter
- rtype:
None
- class dicee.KGE(path=None, url=None, construct_ensemble=False, model_name=None)[source]
Bases:
dicee.abstracts.BaseInteractiveKGE,dicee.abstracts.InteractiveQueryDecomposition,dicee.abstracts.BaseInteractiveTrainKGEKnowledge Graph Embedding Class for interactive usage of pre-trained models
- get_transductive_entity_embeddings(indices: torch.LongTensor | List[str], as_pytorch=False, as_numpy=False, as_list=True) torch.FloatTensor | numpy.ndarray | List[float][source]
- create_vector_database(collection_name: str, distance: str, location: str = 'localhost', port: int = 6333)[source]
- predict_missing_head_entity(relation: List[str] | str, tail_entity: List[str] | str, within=None, batch_size=2, topk=1, return_indices=False) Tuple[source]
Given a relation and a tail entity, return top k ranked head entity.
argmax_{e in E } f(e,r,t), where r in R, t in E.
Parameter
relation: Union[List[str], str]
String representation of selected relations.
tail_entity: Union[List[str], str]
String representation of selected entities.
k: int
Highest ranked k entities.
Returns: Tuple
Highest K scores and entities
- predict_missing_relations(head_entity: List[str] | str, tail_entity: List[str] | str, within=None, batch_size=2, topk=1, return_indices=False) Tuple[source]
Given a head entity and a tail entity, return top k ranked relations.
argmax_{r in R } f(h,r,t), where h, t in E.
Parameter
head_entity: List[str]
String representation of selected entities.
tail_entity: List[str]
String representation of selected entities.
k: int
Highest ranked k entities.
Returns: Tuple
Highest K scores and entities
- predict_missing_tail_entity(head_entity: List[str] | str, relation: List[str] | str, within: List[str] = None, batch_size=2, topk=1, return_indices=False) torch.FloatTensor[source]
Given a head entity and a relation, return top k ranked entities
argmax_{e in E } f(h,r,e), where h in E and r in R.
Parameter
head_entity: List[str]
String representation of selected entities.
tail_entity: List[str]
String representation of selected entities.
Returns: Tuple
scores
- predict(*, h: List[str] | str | None = None, r: List[str] | str | None = None, t: List[str] | str | None = None, within: List[str] | None = None, logits: bool = True) torch.FloatTensor[source]
Predict scores for triples or missing triple elements.
- Parameters:
h – Head entity/entities. None to predict heads.
r – Relation/relations. None to predict relations.
t – Tail entity/entities. None to predict tails.
within – Optional list of entities to restrict predictions to.
logits – If True, return raw scores. If False, return sigmoid scores (0-1).
- Returns:
Single triple (h, r, t): scalar score
Missing element: vector of all possible scores
- Return type:
torch.FloatTensor of scores. Shape depends on the query type
- Raises:
AssertionError – If inputs are not strings or lists of strings.
Examples
>>> # Score a specific triple >>> model.predict(h="Mongolia", r="isLocatedIn", t="Asia", logits=False) tensor(0.9523)
>>> # Get scores for all possible tail entities >>> model.predict(h="Mongolia", r="isLocatedIn", t=None) tensor([0.21, 0.95, 0.03, ...]) # One score per entity
- predict_topk(*, h: str | List[str] | None = None, r: str | List[str] | None = None, t: str | List[str] | None = None, topk: int = 10, within: List[str] | None = None, batch_size: int = 1024) List[Tuple[str, float]] | List[List[Tuple[str, float]]][source]
Predict top-k missing items in a given triple pattern.
- Parameters:
h – Head entity/entities. None to predict heads.
r – Relation/relations. None to predict relations.
t – Tail entity/entities. None to predict tails.
topk – Number of top predictions to return.
within – Optional list of entities to restrict predictions to.
batch_size – Batch size for processing multiple queries.
- Returns:
List[(item, score), …] of length topk. For batch query: List of such lists, one per query.
- Return type:
For single query
- Raises:
AssertionError – If more than one of h, r, t is None.
AssertionError – If the required arguments for a query type are None.
Examples
>>> model.predict_topk(h=["Mongolia"], r=["isLocatedIn"], topk=3) [('Asia', 0.99), ('Europe', 0.02), ...]
>>> model.predict_topk(r=["isLocatedIn"], t=["Asia"], topk=5) [('Mongolia', 0.85), ('China', 0.82), ...]
- triple_score(h: List[str] | str = None, r: List[str] | str = None, t: List[str] | str = None, logits=False) torch.FloatTensor[source]
Predict triple score
Parameter
head_entity: List[str]
String representation of selected entities.
relation: List[str]
String representation of selected relations.
tail_entity: List[str]
String representation of selected entities.
logits: bool
If logits is True, unnormalized score returned
Returns: Tuple
pytorch tensor of triple score
- single_hop_query_answering(query: tuple, only_scores: bool = True, k: int = None, use_logits: bool = True)[source]
- answer_multi_hop_query(query_type: str | None = None, query: Tuple[str | Tuple[str, str], Ellipsis] | None = None, queries: List[Tuple[str | Tuple[str, str], Ellipsis]] | None = None, tnorm: str = 'prod', neg_norm: str = 'standard', lambda_: float = 0.0, k: int = 10, only_scores: bool = False, use_logits: bool = True) List[Tuple[str, torch.Tensor]] | List[List[Tuple[str, torch.Tensor]]][source]
Answer multi-hop EPFO (Existential Positive First-Order) queries.
Supports 9 query types: 1p, 2p, 3p, 2i, 3i, ip, pi, 2u, up. See docs/guides/multi_hop_queries.md for detailed query patterns.
- Parameters:
query_type – Query pattern name. One of: - 1p: (e, (r,)) # One-hop - 2p: (e, (r1, r2)) # Two-hop - 3p: (e, (r1, r2, r3)) # Three-hop - 2i: ((e1, (r1,)), (e2, (r2,))) # Two-way intersection - 3i: ((e1, (r1,)), (e2, (r2,)), (e3, (r3,))) # Three-way intersection - ip: (((e1, (r1,)), (e2, (r2,))), (r3,)) # Intersection + projection - pi: ((e, (r1, r2)), (r3,)) # Projection + intersection (2i meets 2p) - 2u: ((e1, (r1,)), (e2, (r2,))) # Two-way union - up: ((e, (r1, r2)), (e, (r3,))) # Union + projection
query – Single query tuple matching the query_type pattern.
queries – Batch of queries. If provided, query must be None.
tnorm – T-norm for intersection/union. Options: “prod”, “min”.
neg_norm – Negation norm. Options: “standard”, “sugeno”, “yager”.
lambda – Parameter for sugeno and yager negation (0.0-1.0).
k – Number of top answer entities to return.
only_scores – If True, return only scores tensor. If False, return (entity, score) tuples.
use_logits – If True, use raw model logits. If False, use sigmoid probabilities.
- Returns:
List[(entity, score), …] of top-k answers. For batch queries: List of such lists, one per query.
- Return type:
For single query
- Raises:
ValueError – If query_type is not in {1p, 2p, 3p, 2i, 3i, ip, pi, 2u, up}.
AssertionError – If query structure doesn’t match query_type pattern.
Examples
>>> # 1p: Find entities located in Asia >>> model.answer_multi_hop_query( ... query_type="1p", ... query=("Asia", ("isLocatedIn",)), ... k=5 ... ) [("Mongolia", 0.92), ("China", 0.89), ...]
>>> # 2p: Two-hop query (e.g., "capital of countries in Europe") >>> model.answer_multi_hop_query( ... query_type="2p", ... query=("Europe", ("isLocatedIn", "hasCapital")), ... k=3 ... ) [("Paris", 0.85), ("Berlin", 0.82), ...]
>>> # 2i: Intersection query >>> model.answer_multi_hop_query( ... query_type="2i", ... query=(("Asia", ("isLocatedIn",)), ("Mountains", ("hasGeography",))), ... k=5 ... ) [("Nepal", 0.78), ("Tibet", 0.65), ...]
See also
docs/guides/multi_hop_queries.md: Complete guide with all query patterns
tests/test_answer_multi_hop_query.py: Usage examples
- find_missing_triples(confidence: float, entities: List[str] = None, relations: List[str] = None, topk: int = 10, at_most: int = sys.maxsize) Set[source]
Find missing triples
Iterative over a set of entities E and a set of relation R :
orall e in E and orall r in R f(e,r,x)
Return (e,r,x)
otin G and f(e,r,x) > confidence
confidence: float
A threshold for an output of a sigmoid function given a triple.
topk: int
Highest ranked k item to select triples with f(e,r,x) > confidence .
at_most: int
Stop after finding at_most missing triples
{(e,r,x) | f(e,r,x) > confidence land (e,r,x)
otin G
- predict_literals(entity: List[str] | str = None, attribute: List[str] | str = None, denormalize_preds: bool = True) numpy.ndarray[source]
Predicts literal values for given entities and attributes.
- Parameters:
entity (Union[List[str], str]) – Entity or list of entities to predict literals for.
attribute (Union[List[str], str]) – Attribute or list of attributes to predict literals for.
denormalize_preds (bool) – If True, denormalizes the predictions.
- Returns:
Predictions for the given entities and attributes.
- Return type:
numpy ndarray
- class dicee.QueryGenerator(train_path, val_path: str, test_path: str, ent2id: Dict = None, rel2id: Dict = None, seed: int = 1, gen_valid: bool = False, gen_test: bool = True)[source]
- train_path
- val_path
- test_path
- gen_valid = False
- gen_test = True
- seed = 1
- max_ans_num = 1000000.0
- mode
- ent2id = None
- rel2id: Dict = None
- ent_in: Dict
- ent_out: Dict
- query_name_to_struct
- construct_graph(paths: List[str]) Tuple[Dict, Dict][source]
Construct graph from triples Returns dicts with incoming and outgoing edges
- fill_query(query_structure: List[str | List], ent_in: Dict, ent_out: Dict, answer: int) bool[source]
Private method for fill_query logic.
- achieve_answer(query: List[str | List], ent_in: Dict, ent_out: Dict) set[source]
Private method for achieve_answer logic. @TODO: Document the code
- ground_queries(query_structure: List[str | List], ent_in: Dict, ent_out: Dict, small_ent_in: Dict, small_ent_out: Dict, gen_num: int, query_name: str)[source]
Generating queries and achieving answers
- generate_queries(query_struct: List, gen_num: int, query_type: str)[source]
Passing incoming and outgoing edges to ground queries depending on mode [train valid or text] and getting queries and answers in return @ TODO: create a class for each single query struct
- class dicee.DICE_Trainer(args, is_continual_training: bool, storage_path, evaluator=None)[source]
- DICE_Trainer implement
1- Pytorch Lightning trainer (https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html) 2- Multi-GPU Trainer(https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) 3- CPU Trainer
args
is_continual_training:bool
storage_path:str
evaluator:
report:dict
- report
- args
- trainer = None
- is_continual_training
- storage_path
- evaluator = None
- form_of_labelling = None
- continual_start(knowledge_graph)[source]
Initialize training.
Load model
(3) Load trainer (3) Fit model
Parameter
- returns:
model
form_of_labelling (str)
- initialize_trainer(callbacks: List) lightning.Trainer | dicee.trainer.model_parallelism.TensorParallel | dicee.trainer.torch_trainer.TorchTrainer | dicee.trainer.torch_trainer_ddp.TorchDDPTrainer[source]
Initialize Trainer from input arguments
- start(knowledge_graph: dicee.knowledge_graph.KG | numpy.memmap) Tuple[dicee.models.base_model.BaseKGE, str][source]
Start the training
Initialize Trainer
Initialize or load a pretrained KGE model
in DDP setup, we need to load the memory map of already read/index KG.
- k_fold_cross_validation(dataset) Tuple[dicee.models.base_model.BaseKGE, str][source]
Perform K-fold Cross-Validation
Obtain K train and test splits.
- For each split,
2.1 initialize trainer and model 2.2. Train model with configuration provided in args. 2.3. Compute the mean reciprocal rank (MRR) score of the model on the test respective split.
Report the mean and average MRR .
- Parameters:
self
dataset
- Returns:
model
- class dicee.Evaluator(args, is_continual_training: bool = False)[source]
Evaluator class for KGE models in various downstream tasks.
Orchestrates link prediction evaluation with different scoring techniques including standard evaluation and byte-pair encoding based evaluation.
- er_vocab
Entity-relation to tail vocabulary for filtered ranking.
- re_vocab
Relation-entity (tail) to head vocabulary.
- ee_vocab
Entity-entity to relation vocabulary.
- num_entities
Total number of entities in the knowledge graph.
- num_relations
Total number of relations in the knowledge graph.
- args
Configuration arguments.
- report
Dictionary storing evaluation results.
- during_training
Whether evaluation is happening during training.
Example
>>> from dicee.evaluation import Evaluator >>> evaluator = Evaluator(args) >>> results = evaluator.eval(dataset, model, 'EntityPrediction') >>> print(f"Test MRR: {results['Test']['MRR']:.4f}")
- re_vocab: Dict | None = None
- er_vocab: Dict | None = None
- ee_vocab: Dict | None = None
- func_triple_to_bpe_representation = None
- is_continual_training = False
- num_entities: int | None = None
- num_relations: int | None = None
- domain_constraints_per_rel = None
- range_constraints_per_rel = None
- args
- report: Dict
- during_training = False
- vocab_preparation(dataset) None[source]
Prepare vocabularies from the dataset for evaluation.
Resolves any future objects and saves vocabularies to disk.
- Parameters:
dataset – Knowledge graph dataset with vocabulary attributes.
- eval(dataset, trained_model, form_of_labelling: str, during_training: bool = False) Dict | None[source]
Evaluate the trained model on the dataset.
- Parameters:
dataset – Knowledge graph dataset (KG instance).
trained_model – The trained KGE model.
form_of_labelling – Type of labelling (‘EntityPrediction’ or ‘RelationPrediction’).
during_training – Whether evaluation is during training.
- Returns:
Dictionary of evaluation metrics, or None if evaluation is skipped.
- eval_rank_of_head_and_tail_entity(*, train_set, valid_set=None, test_set=None, trained_model) None[source]
Evaluate with negative sampling scoring.
- eval_rank_of_head_and_tail_byte_pair_encoded_entity(*, train_set=None, valid_set=None, test_set=None, ordered_bpe_entities, trained_model) None[source]
Evaluate with BPE-encoded entities and negative sampling.
- eval_with_byte(*, raw_train_set, raw_valid_set=None, raw_test_set=None, trained_model, form_of_labelling) None[source]
Evaluate BytE model with generation.
- eval_with_bpe_vs_all(*, raw_train_set, raw_valid_set=None, raw_test_set=None, trained_model, form_of_labelling) None[source]
Evaluate with BPE and KvsAll scoring.
- eval_with_vs_all(*, train_set, valid_set=None, test_set=None, trained_model, form_of_labelling) None[source]
Evaluate with KvsAll or 1vsAll scoring.
- evaluate_lp_k_vs_all(model, triple_idx, info: str | None = None, form_of_labelling: str | None = None) Dict[str, float][source]
Filtered link prediction evaluation with KvsAll scoring.
- Parameters:
model – The trained model to evaluate.
triple_idx – Integer-indexed test triples.
info – Description to print.
form_of_labelling – ‘EntityPrediction’ or ‘RelationPrediction’.
- Returns:
Dictionary with H@1, H@3, H@10, and MRR metrics.
- evaluate_lp_with_byte(model, triples: List[List[str]], info: str | None = None) Dict[str, float][source]
Evaluate BytE model with text generation.
- Parameters:
model – BytE model.
triples – String triples.
info – Description to print.
- Returns:
Dictionary with placeholder metrics (-1 values).
- evaluate_lp_bpe_k_vs_all(model, triples: List[List[str]], info: str | None = None, form_of_labelling: str | None = None) Dict[str, float][source]
Evaluate BPE model with KvsAll scoring.
- Parameters:
model – BPE-enabled model.
triples – String triples.
info – Description to print.
form_of_labelling – Type of labelling.
- Returns:
Dictionary with H@1, H@3, H@10, and MRR metrics.
- evaluate_lp(model, triple_idx, info: str) Dict[str, float][source]
Evaluate link prediction with negative sampling.
- Parameters:
model – The model to evaluate.
triple_idx – Integer-indexed triples.
info – Description to print.
- Returns:
Dictionary with H@1, H@3, H@10, and MRR metrics.
- dummy_eval(trained_model, form_of_labelling: str) None[source]
Run evaluation from saved data (for continual training).
- Parameters:
trained_model – The trained model.
form_of_labelling – Type of labelling.
- eval_with_data(dataset, trained_model, triple_idx: numpy.ndarray, form_of_labelling: str) Dict[str, float][source]
Evaluate a trained model on a given dataset.
- Parameters:
dataset – Knowledge graph dataset.
trained_model – The trained model.
triple_idx – Integer-indexed triples to evaluate.
form_of_labelling – Type of labelling.
- Returns:
Dictionary with evaluation metrics.
- Raises:
ValueError – If scoring technique is invalid.
- dicee.__version__ = '0.3.3'