dicee
DICE Embeddings - Knowledge Graph Embedding Library.
A library for training and using knowledge graph embedding models with support for various scoring techniques and training strategies.
- Submodules:
evaluation: Model evaluation functions and Evaluator class models: KGE model implementations trainer: Training orchestration scripts: Utility scripts
Submodules
- dicee.__main__
- dicee.abstracts
- dicee.analyse_experiments
- dicee.callbacks
- dicee.config
- dicee.dataset_classes
- dicee.eval_static_funcs
- dicee.evaluation
- dicee.evaluator
- dicee.executer
- dicee.knowledge_graph
- dicee.knowledge_graph_embeddings
- dicee.models
- dicee.query_generator
- dicee.read_preprocess_save_load_kg
- dicee.sanity_checkers
- dicee.scripts
- dicee.static_funcs
- dicee.static_funcs_training
- dicee.static_preprocess_funcs
- dicee.trainer
- dicee.weight_averaging
Attributes
Classes
Executor class for training, retraining and evaluating KGE models. |
|
Knowledge Graph Embedding Class for interactive usage of pre-trained models |
|
DICE_Trainer implement |
|
Evaluator class for KGE models in various downstream tasks. |
Package Contents
- class dicee.Execute(args, continuous_training: bool = False)[source]
Executor class for training, retraining and evaluating KGE models.
Handles the complete workflow: 1. Loading & Preprocessing & Serializing input data 2. Training & Validation & Testing 3. Storing all necessary information
- args
Processed input arguments.
- distributed
Whether distributed training is enabled.
- rank
Process rank in distributed training.
- world_size
Total number of processes.
- local_rank
Local GPU rank.
- trainer
Training handler instance.
- trained_model
The trained model after training completes.
- knowledge_graph
The loaded knowledge graph.
- report
Dictionary storing training metrics and results.
- evaluator
Model evaluation handler.
- distributed
- args
- is_continual_training = False
- trainer: dicee.trainer.DICE_Trainer | None = None
- trained_model = None
- knowledge_graph: dicee.knowledge_graph.KG | None = None
- report: Dict
- evaluator: dicee.evaluator.Evaluator | None = None
- start_time: float | None = None
- setup_executor() None[source]
Set up storage directories for the experiment.
Creates or reuses experiment directories based on configuration. Saves the configuration to a JSON file.
- create_and_store_kg() None[source]
Create knowledge graph and store as memory-mapped file.
Only executed on rank 0 in distributed training. Skips if memmap already exists.
- save_trained_model() None[source]
Save a knowledge graph embedding model
Send model to eval mode and cpu.
Store the memory footprint of the model.
Save the model into disk.
Update the stats of KG again ?
Parameter
- rtype:
None
- class dicee.KGE(path=None, url=None, construct_ensemble=False, model_name=None)[source]
Bases:
dicee.abstracts.BaseInteractiveKGE,dicee.abstracts.InteractiveQueryDecomposition,dicee.abstracts.BaseInteractiveTrainKGEKnowledge Graph Embedding Class for interactive usage of pre-trained models
- get_transductive_entity_embeddings(indices: torch.LongTensor | List[str], as_pytorch=False, as_numpy=False, as_list=True) torch.FloatTensor | numpy.ndarray | List[float][source]
- create_vector_database(collection_name: str, distance: str, location: str = 'localhost', port: int = 6333)[source]
- predict_missing_head_entity(relation: List[str] | str, tail_entity: List[str] | str, within=None, batch_size=2, topk=1, return_indices=False) Tuple[source]
Given a relation and a tail entity, return top k ranked head entity.
argmax_{e in E } f(e,r,t), where r in R, t in E.
Parameter
relation: Union[List[str], str]
String representation of selected relations.
tail_entity: Union[List[str], str]
String representation of selected entities.
k: int
Highest ranked k entities.
Returns: Tuple
Highest K scores and entities
- predict_missing_relations(head_entity: List[str] | str, tail_entity: List[str] | str, within=None, batch_size=2, topk=1, return_indices=False) Tuple[source]
Given a head entity and a tail entity, return top k ranked relations.
argmax_{r in R } f(h,r,t), where h, t in E.
Parameter
head_entity: List[str]
String representation of selected entities.
tail_entity: List[str]
String representation of selected entities.
k: int
Highest ranked k entities.
Returns: Tuple
Highest K scores and entities
- predict_missing_tail_entity(head_entity: List[str] | str, relation: List[str] | str, within: List[str] = None, batch_size=2, topk=1, return_indices=False) torch.FloatTensor[source]
Given a head entity and a relation, return top k ranked entities
argmax_{e in E } f(h,r,e), where h in E and r in R.
Parameter
head_entity: List[str]
String representation of selected entities.
tail_entity: List[str]
String representation of selected entities.
Returns: Tuple
scores
- predict(*, h: List[str] | str = None, r: List[str] | str = None, t: List[str] | str = None, within=None, logits=True) torch.FloatTensor[source]
- Parameters:
logits
h
r
t
within
- predict_topk(*, h: str | List[str] = None, r: str | List[str] = None, t: str | List[str] = None, topk: int = 10, within: List[str] = None, batch_size: int = 1024)[source]
Predict missing item in a given triple.
- Returns:
If you query a single (h, r, ?) or (?, r, t) or (h, ?, t), returns List[(item, score)]
If you query a batch of B, returns List of B such lists.
- triple_score(h: List[str] | str = None, r: List[str] | str = None, t: List[str] | str = None, logits=False) torch.FloatTensor[source]
Predict triple score
Parameter
head_entity: List[str]
String representation of selected entities.
relation: List[str]
String representation of selected relations.
tail_entity: List[str]
String representation of selected entities.
logits: bool
If logits is True, unnormalized score returned
Returns: Tuple
pytorch tensor of triple score
- answer_multi_hop_query(query_type: str = None, query: Tuple[str | Tuple[str, str], Ellipsis] = None, queries: List[Tuple[str | Tuple[str, str], Ellipsis]] = None, tnorm: str = 'prod', neg_norm: str = 'standard', lambda_: float = 0.0, k: int = 10, only_scores=False) List[Tuple[str, torch.Tensor]][source]
# @TODO: Refactoring is needed # @TODO: Score computation for each query type should be done in a static function
Find an answer set for EPFO queries including negation and disjunction
Parameter
query_type: str The type of the query, e.g., “2p”.
query: Union[str, Tuple[str, Tuple[str, str]]] The query itself, either a string or a nested tuple.
queries: List of Tuple[Union[str, Tuple[str, str]], …]
tnorm: str The t-norm operator.
neg_norm: str The negation norm.
lambda_: float lambda parameter for sugeno and yager negation norms
k: int The top-k substitutions for intermediate variables.
- returns:
List[Tuple[str, torch.Tensor]]
Entities and corresponding scores sorted in the descening order of scores
- find_missing_triples(confidence: float, entities: List[str] = None, relations: List[str] = None, topk: int = 10, at_most: int = sys.maxsize) Set[source]
Find missing triples
Iterative over a set of entities E and a set of relation R :
orall e in E and orall r in R f(e,r,x)
Return (e,r,x)
otin G and f(e,r,x) > confidence
confidence: float
A threshold for an output of a sigmoid function given a triple.
topk: int
Highest ranked k item to select triples with f(e,r,x) > confidence .
at_most: int
Stop after finding at_most missing triples
{(e,r,x) | f(e,r,x) > confidence land (e,r,x)
otin G
- predict_literals(entity: List[str] | str = None, attribute: List[str] | str = None, denormalize_preds: bool = True) numpy.ndarray[source]
Predicts literal values for given entities and attributes.
- Parameters:
entity (Union[List[str], str]) – Entity or list of entities to predict literals for.
attribute (Union[List[str], str]) – Attribute or list of attributes to predict literals for.
denormalize_preds (bool) – If True, denormalizes the predictions.
- Returns:
Predictions for the given entities and attributes.
- Return type:
numpy ndarray
- class dicee.QueryGenerator(train_path, val_path: str, test_path: str, ent2id: Dict = None, rel2id: Dict = None, seed: int = 1, gen_valid: bool = False, gen_test: bool = True)[source]
- train_path
- val_path
- test_path
- gen_valid = False
- gen_test = True
- seed = 1
- max_ans_num = 1000000.0
- mode
- ent2id = None
- rel2id: Dict = None
- ent_in: Dict
- ent_out: Dict
- query_name_to_struct
- construct_graph(paths: List[str]) Tuple[Dict, Dict][source]
Construct graph from triples Returns dicts with incoming and outgoing edges
- fill_query(query_structure: List[str | List], ent_in: Dict, ent_out: Dict, answer: int) bool[source]
Private method for fill_query logic.
- achieve_answer(query: List[str | List], ent_in: Dict, ent_out: Dict) set[source]
Private method for achieve_answer logic. @TODO: Document the code
- ground_queries(query_structure: List[str | List], ent_in: Dict, ent_out: Dict, small_ent_in: Dict, small_ent_out: Dict, gen_num: int, query_name: str)[source]
Generating queries and achieving answers
- generate_queries(query_struct: List, gen_num: int, query_type: str)[source]
Passing incoming and outgoing edges to ground queries depending on mode [train valid or text] and getting queries and answers in return @ TODO: create a class for each single query struct
- class dicee.DICE_Trainer(args, is_continual_training: bool, storage_path, evaluator=None)[source]
- DICE_Trainer implement
1- Pytorch Lightning trainer (https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html) 2- Multi-GPU Trainer(https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) 3- CPU Trainer
args
is_continual_training:bool
storage_path:str
evaluator:
report:dict
- report
- args
- trainer = None
- is_continual_training
- storage_path
- evaluator = None
- form_of_labelling = None
- continual_start(knowledge_graph)[source]
Initialize training.
Load model
(3) Load trainer (3) Fit model
Parameter
- returns:
model
form_of_labelling (str)
- initialize_trainer(callbacks: List) lightning.Trainer | dicee.trainer.model_parallelism.TensorParallel | dicee.trainer.torch_trainer.TorchTrainer | dicee.trainer.torch_trainer_ddp.TorchDDPTrainer[source]
Initialize Trainer from input arguments
- start(knowledge_graph: dicee.knowledge_graph.KG | numpy.memmap) Tuple[dicee.models.base_model.BaseKGE, str][source]
Start the training
Initialize Trainer
Initialize or load a pretrained KGE model
in DDP setup, we need to load the memory map of already read/index KG.
- k_fold_cross_validation(dataset) Tuple[dicee.models.base_model.BaseKGE, str][source]
Perform K-fold Cross-Validation
Obtain K train and test splits.
- For each split,
2.1 initialize trainer and model 2.2. Train model with configuration provided in args. 2.3. Compute the mean reciprocal rank (MRR) score of the model on the test respective split.
Report the mean and average MRR .
- Parameters:
self
dataset
- Returns:
model
- class dicee.Evaluator(args, is_continual_training: bool = False)[source]
Evaluator class for KGE models in various downstream tasks.
Orchestrates link prediction evaluation with different scoring techniques including standard evaluation and byte-pair encoding based evaluation.
- er_vocab
Entity-relation to tail vocabulary for filtered ranking.
- re_vocab
Relation-entity (tail) to head vocabulary.
- ee_vocab
Entity-entity to relation vocabulary.
- num_entities
Total number of entities in the knowledge graph.
- num_relations
Total number of relations in the knowledge graph.
- args
Configuration arguments.
- report
Dictionary storing evaluation results.
- during_training
Whether evaluation is happening during training.
Example
>>> from dicee.evaluation import Evaluator >>> evaluator = Evaluator(args) >>> results = evaluator.eval(dataset, model, 'EntityPrediction') >>> print(f"Test MRR: {results['Test']['MRR']:.4f}")
- re_vocab: Dict | None = None
- er_vocab: Dict | None = None
- ee_vocab: Dict | None = None
- func_triple_to_bpe_representation = None
- is_continual_training = False
- num_entities: int | None = None
- num_relations: int | None = None
- domain_constraints_per_rel = None
- range_constraints_per_rel = None
- args
- report: Dict
- during_training = False
- vocab_preparation(dataset) None[source]
Prepare vocabularies from the dataset for evaluation.
Resolves any future objects and saves vocabularies to disk.
- Parameters:
dataset – Knowledge graph dataset with vocabulary attributes.
- eval(dataset, trained_model, form_of_labelling: str, during_training: bool = False) Dict | None[source]
Evaluate the trained model on the dataset.
- Parameters:
dataset – Knowledge graph dataset (KG instance).
trained_model – The trained KGE model.
form_of_labelling – Type of labelling (‘EntityPrediction’ or ‘RelationPrediction’).
during_training – Whether evaluation is during training.
- Returns:
Dictionary of evaluation metrics, or None if evaluation is skipped.
- eval_rank_of_head_and_tail_entity(*, train_set, valid_set=None, test_set=None, trained_model) None[source]
Evaluate with negative sampling scoring.
- eval_rank_of_head_and_tail_byte_pair_encoded_entity(*, train_set=None, valid_set=None, test_set=None, ordered_bpe_entities, trained_model) None[source]
Evaluate with BPE-encoded entities and negative sampling.
- eval_with_byte(*, raw_train_set, raw_valid_set=None, raw_test_set=None, trained_model, form_of_labelling) None[source]
Evaluate BytE model with generation.
- eval_with_bpe_vs_all(*, raw_train_set, raw_valid_set=None, raw_test_set=None, trained_model, form_of_labelling) None[source]
Evaluate with BPE and KvsAll scoring.
- eval_with_vs_all(*, train_set, valid_set=None, test_set=None, trained_model, form_of_labelling) None[source]
Evaluate with KvsAll or 1vsAll scoring.
- evaluate_lp_k_vs_all(model, triple_idx, info: str | None = None, form_of_labelling: str | None = None) Dict[str, float][source]
Filtered link prediction evaluation with KvsAll scoring.
- Parameters:
model – The trained model to evaluate.
triple_idx – Integer-indexed test triples.
info – Description to print.
form_of_labelling – ‘EntityPrediction’ or ‘RelationPrediction’.
- Returns:
Dictionary with H@1, H@3, H@10, and MRR metrics.
- evaluate_lp_with_byte(model, triples: List[List[str]], info: str | None = None) Dict[str, float][source]
Evaluate BytE model with text generation.
- Parameters:
model – BytE model.
triples – String triples.
info – Description to print.
- Returns:
Dictionary with placeholder metrics (-1 values).
- evaluate_lp_bpe_k_vs_all(model, triples: List[List[str]], info: str | None = None, form_of_labelling: str | None = None) Dict[str, float][source]
Evaluate BPE model with KvsAll scoring.
- Parameters:
model – BPE-enabled model.
triples – String triples.
info – Description to print.
form_of_labelling – Type of labelling.
- Returns:
Dictionary with H@1, H@3, H@10, and MRR metrics.
- evaluate_lp(model, triple_idx, info: str) Dict[str, float][source]
Evaluate link prediction with negative sampling.
- Parameters:
model – The model to evaluate.
triple_idx – Integer-indexed triples.
info – Description to print.
- Returns:
Dictionary with H@1, H@3, H@10, and MRR metrics.
- dummy_eval(trained_model, form_of_labelling: str) None[source]
Run evaluation from saved data (for continual training).
- Parameters:
trained_model – The trained model.
form_of_labelling – Type of labelling.
- eval_with_data(dataset, trained_model, triple_idx: numpy.ndarray, form_of_labelling: str) Dict[str, float][source]
Evaluate a trained model on a given dataset.
- Parameters:
dataset – Knowledge graph dataset.
trained_model – The trained model.
triple_idx – Integer-indexed triples to evaluate.
form_of_labelling – Type of labelling.
- Returns:
Dictionary with evaluation metrics.
- Raises:
ValueError – If scoring technique is invalid.
- dicee.__version__ = '0.3.2'