dicee
DICE Embeddings - Knowledge Graph Embedding Library.
A library for training and using knowledge graph embedding models with support for various scoring techniques and training strategies.
- Submodules:
evaluation: Model evaluation functions and Evaluator class models: KGE model implementations trainer: Training orchestration scripts: Utility scripts
Submodules
- dicee.__main__
- dicee.abstracts
- dicee.analyse_experiments
- dicee.callbacks
- dicee.config
- dicee.dataset_classes
- dicee.eval_static_funcs
- dicee.evaluation
- dicee.evaluator
- dicee.executer
- dicee.knowledge_graph
- dicee.knowledge_graph_embeddings
- dicee.models
- dicee.query_generator
- dicee.read_preprocess_save_load_kg
- dicee.sanity_checkers
- dicee.scripts
- dicee.static_funcs
- dicee.static_funcs_training
- dicee.static_preprocess_funcs
- dicee.trainer
- dicee.weight_averaging
Attributes
Classes
Executor class for training, retraining and evaluating KGE models. |
|
Knowledge Graph Embedding Class for interactive usage of pre-trained models |
|
DICE_Trainer implement |
|
Evaluator class for KGE models in various downstream tasks. |
Package Contents
- class dicee.Execute(args, continuous_training: bool = False)
Executor class for training, retraining and evaluating KGE models.
Handles the complete workflow: 1. Loading & Preprocessing & Serializing input data 2. Training & Validation & Testing 3. Storing all necessary information
- args
Processed input arguments.
- distributed
Whether distributed training is enabled.
- rank
Process rank in distributed training.
- world_size
Total number of processes.
- local_rank
Local GPU rank.
- trainer
Training handler instance.
- trained_model
The trained model after training completes.
- knowledge_graph
The loaded knowledge graph.
- report
Dictionary storing training metrics and results.
- evaluator
Model evaluation handler.
- distributed
- args
- is_continual_training = False
- trainer: dicee.trainer.DICE_Trainer | None = None
- trained_model = None
- knowledge_graph: dicee.knowledge_graph.KG | None = None
- report: Dict
- evaluator: dicee.evaluator.Evaluator | None = None
- start_time: float | None = None
- is_rank_zero() bool
- cleanup()
- setup_executor() None
Set up storage directories for the experiment.
Creates or reuses experiment directories based on configuration. Saves the configuration to a JSON file.
- create_and_store_kg() None
Create knowledge graph and store as memory-mapped file.
Only executed on rank 0 in distributed training. Skips if memmap already exists.
- load_from_memmap() None
Load knowledge graph from memory-mapped file.
- save_trained_model() None
Save a knowledge graph embedding model
Send model to eval mode and cpu.
Store the memory footprint of the model.
Save the model into disk.
Update the stats of KG again ?
Parameter
- rtype:
None
- end(form_of_labelling: str) dict
End training
Store trained model.
Report runtimes.
Eval model if required.
Parameter
- rtype:
A dict containing information about the training and/or evaluation
- write_report() None
Report training related information in a report.json file
- class dicee.KGE(path=None, url=None, construct_ensemble=False, model_name=None)
Bases:
dicee.abstracts.BaseInteractiveKGE,dicee.abstracts.InteractiveQueryDecomposition,dicee.abstracts.BaseInteractiveTrainKGEKnowledge Graph Embedding Class for interactive usage of pre-trained models
- __str__()
- to(device: str) None
- get_transductive_entity_embeddings(indices: torch.LongTensor | List[str], as_pytorch=False, as_numpy=False, as_list=True) torch.FloatTensor | numpy.ndarray | List[float]
- create_vector_database(collection_name: str, distance: str, location: str = 'localhost', port: int = 6333)
- generate(h='', r='')
- eval_lp_performance(dataset=List[Tuple[str, str, str]], filtered=True)
- predict_missing_head_entity(relation: List[str] | str, tail_entity: List[str] | str, within=None, batch_size=2, topk=1, return_indices=False) Tuple
Given a relation and a tail entity, return top k ranked head entity.
argmax_{e in E } f(e,r,t), where r in R, t in E.
Parameter
relation: Union[List[str], str]
String representation of selected relations.
tail_entity: Union[List[str], str]
String representation of selected entities.
k: int
Highest ranked k entities.
Returns: Tuple
Highest K scores and entities
- predict_missing_relations(head_entity: List[str] | str, tail_entity: List[str] | str, within=None, batch_size=2, topk=1, return_indices=False) Tuple
Given a head entity and a tail entity, return top k ranked relations.
argmax_{r in R } f(h,r,t), where h, t in E.
Parameter
head_entity: List[str]
String representation of selected entities.
tail_entity: List[str]
String representation of selected entities.
k: int
Highest ranked k entities.
Returns: Tuple
Highest K scores and entities
- predict_missing_tail_entity(head_entity: List[str] | str, relation: List[str] | str, within: List[str] = None, batch_size=2, topk=1, return_indices=False) torch.FloatTensor
Given a head entity and a relation, return top k ranked entities
argmax_{e in E } f(h,r,e), where h in E and r in R.
Parameter
head_entity: List[str]
String representation of selected entities.
tail_entity: List[str]
String representation of selected entities.
Returns: Tuple
scores
- predict(*, h: List[str] | str = None, r: List[str] | str = None, t: List[str] | str = None, within=None, logits=True) torch.FloatTensor
- Parameters:
logits
h
r
t
within
- predict_topk(*, h: str | List[str] = None, r: str | List[str] = None, t: str | List[str] = None, topk: int = 10, within: List[str] = None, batch_size: int = 1024)
Predict missing item in a given triple.
- Returns:
If you query a single (h, r, ?) or (?, r, t) or (h, ?, t), returns List[(item, score)]
If you query a batch of B, returns List of B such lists.
- triple_score(h: List[str] | str = None, r: List[str] | str = None, t: List[str] | str = None, logits=False) torch.FloatTensor
Predict triple score
Parameter
head_entity: List[str]
String representation of selected entities.
relation: List[str]
String representation of selected relations.
tail_entity: List[str]
String representation of selected entities.
logits: bool
If logits is True, unnormalized score returned
Returns: Tuple
pytorch tensor of triple score
- return_multi_hop_query_results(aggregated_query_for_all_entities, k: int, only_scores)
- single_hop_query_answering(query: tuple, only_scores: bool = True, k: int = None)
- answer_multi_hop_query(query_type: str = None, query: Tuple[str | Tuple[str, str], Ellipsis] = None, queries: List[Tuple[str | Tuple[str, str], Ellipsis]] = None, tnorm: str = 'prod', neg_norm: str = 'standard', lambda_: float = 0.0, k: int = 10, only_scores=False) List[Tuple[str, torch.Tensor]]
# @TODO: Refactoring is needed # @TODO: Score computation for each query type should be done in a static function
Find an answer set for EPFO queries including negation and disjunction
Parameter
query_type: str The type of the query, e.g., “2p”.
query: Union[str, Tuple[str, Tuple[str, str]]] The query itself, either a string or a nested tuple.
queries: List of Tuple[Union[str, Tuple[str, str]], …]
tnorm: str The t-norm operator.
neg_norm: str The negation norm.
lambda_: float lambda parameter for sugeno and yager negation norms
k: int The top-k substitutions for intermediate variables.
- returns:
List[Tuple[str, torch.Tensor]]
Entities and corresponding scores sorted in the descening order of scores
- find_missing_triples(confidence: float, entities: List[str] = None, relations: List[str] = None, topk: int = 10, at_most: int = sys.maxsize) Set
Find missing triples
Iterative over a set of entities E and a set of relation R :
orall e in E and orall r in R f(e,r,x)
Return (e,r,x)
otin G and f(e,r,x) > confidence
confidence: float
A threshold for an output of a sigmoid function given a triple.
topk: int
Highest ranked k item to select triples with f(e,r,x) > confidence .
at_most: int
Stop after finding at_most missing triples
{(e,r,x) | f(e,r,x) > confidence land (e,r,x)
otin G
- predict_literals(entity: List[str] | str = None, attribute: List[str] | str = None, denormalize_preds: bool = True) numpy.ndarray
Predicts literal values for given entities and attributes.
- Parameters:
entity (Union[List[str], str]) – Entity or list of entities to predict literals for.
attribute (Union[List[str], str]) – Attribute or list of attributes to predict literals for.
denormalize_preds (bool) – If True, denormalizes the predictions.
- Returns:
Predictions for the given entities and attributes.
- Return type:
numpy ndarray
- class dicee.QueryGenerator(train_path, val_path: str, test_path: str, ent2id: Dict = None, rel2id: Dict = None, seed: int = 1, gen_valid: bool = False, gen_test: bool = True)
- train_path
- val_path
- test_path
- gen_valid = False
- gen_test = True
- seed = 1
- max_ans_num = 1000000.0
- mode
- ent2id = None
- rel2id: Dict = None
- ent_in: Dict
- ent_out: Dict
- query_name_to_struct
- list2tuple(list_data)
- tuple2list(x: List | Tuple) List | Tuple
Convert a nested tuple to a nested list.
- set_global_seed(seed: int)
Set seed
- construct_graph(paths: List[str]) Tuple[Dict, Dict]
Construct graph from triples Returns dicts with incoming and outgoing edges
- fill_query(query_structure: List[str | List], ent_in: Dict, ent_out: Dict, answer: int) bool
Private method for fill_query logic.
- achieve_answer(query: List[str | List], ent_in: Dict, ent_out: Dict) set
Private method for achieve_answer logic. @TODO: Document the code
- write_links(ent_out, small_ent_out)
- ground_queries(query_structure: List[str | List], ent_in: Dict, ent_out: Dict, small_ent_in: Dict, small_ent_out: Dict, gen_num: int, query_name: str)
Generating queries and achieving answers
- unmap(query_type, queries, tp_answers, fp_answers, fn_answers)
- unmap_query(query_structure, query, id2ent, id2rel)
- generate_queries(query_struct: List, gen_num: int, query_type: str)
Passing incoming and outgoing edges to ground queries depending on mode [train valid or text] and getting queries and answers in return @ TODO: create a class for each single query struct
- save_queries(query_type: str, gen_num: int, save_path: str)
- abstract load_queries(path)
- get_queries(query_type: str, gen_num: int)
- static save_queries_and_answers(path: str, data: List[Tuple[str, Tuple[collections.defaultdict]]]) None
Save Queries into Disk
- static load_queries_and_answers(path: str) List[Tuple[str, Tuple[collections.defaultdict]]]
Load Queries from Disk to Memory
- class dicee.DICE_Trainer(args, is_continual_training: bool, storage_path, evaluator=None)
- DICE_Trainer implement
1- Pytorch Lightning trainer (https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html) 2- Multi-GPU Trainer(https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) 3- CPU Trainer
args
is_continual_training:bool
storage_path:str
evaluator:
report:dict
- report
- args
- trainer = None
- is_continual_training
- storage_path
- evaluator = None
- form_of_labelling = None
- continual_start(knowledge_graph)
Initialize training.
Load model
(3) Load trainer (3) Fit model
Parameter
- returns:
model
form_of_labelling (str)
- initialize_trainer(callbacks: List) lightning.Trainer | dicee.trainer.model_parallelism.TensorParallel | dicee.trainer.torch_trainer.TorchTrainer | dicee.trainer.torch_trainer_ddp.TorchDDPTrainer
Initialize Trainer from input arguments
- initialize_or_load_model()
- init_dataloader(dataset: torch.utils.data.Dataset) torch.utils.data.DataLoader
- init_dataset() torch.utils.data.Dataset
- start(knowledge_graph: dicee.knowledge_graph.KG | numpy.memmap) Tuple[dicee.models.base_model.BaseKGE, str]
Start the training
Initialize Trainer
Initialize or load a pretrained KGE model
in DDP setup, we need to load the memory map of already read/index KG.
- k_fold_cross_validation(dataset) Tuple[dicee.models.base_model.BaseKGE, str]
Perform K-fold Cross-Validation
Obtain K train and test splits.
- For each split,
2.1 initialize trainer and model 2.2. Train model with configuration provided in args. 2.3. Compute the mean reciprocal rank (MRR) score of the model on the test respective split.
Report the mean and average MRR .
- Parameters:
self
dataset
- Returns:
model
- class dicee.Evaluator(args, is_continual_training: bool = False)
Evaluator class for KGE models in various downstream tasks.
Orchestrates link prediction evaluation with different scoring techniques including standard evaluation and byte-pair encoding based evaluation.
- er_vocab
Entity-relation to tail vocabulary for filtered ranking.
- re_vocab
Relation-entity (tail) to head vocabulary.
- ee_vocab
Entity-entity to relation vocabulary.
- num_entities
Total number of entities in the knowledge graph.
- num_relations
Total number of relations in the knowledge graph.
- args
Configuration arguments.
- report
Dictionary storing evaluation results.
- during_training
Whether evaluation is happening during training.
Example
>>> from dicee.evaluation import Evaluator >>> evaluator = Evaluator(args) >>> results = evaluator.eval(dataset, model, 'EntityPrediction') >>> print(f"Test MRR: {results['Test']['MRR']:.4f}")
- re_vocab: Dict | None = None
- er_vocab: Dict | None = None
- ee_vocab: Dict | None = None
- func_triple_to_bpe_representation = None
- is_continual_training = False
- num_entities: int | None = None
- num_relations: int | None = None
- domain_constraints_per_rel = None
- range_constraints_per_rel = None
- args
- report: Dict
- during_training = False
- vocab_preparation(dataset) None
Prepare vocabularies from the dataset for evaluation.
Resolves any future objects and saves vocabularies to disk.
- Parameters:
dataset – Knowledge graph dataset with vocabulary attributes.
- eval(dataset, trained_model, form_of_labelling: str, during_training: bool = False) Dict | None
Evaluate the trained model on the dataset.
- Parameters:
dataset – Knowledge graph dataset (KG instance).
trained_model – The trained KGE model.
form_of_labelling – Type of labelling (‘EntityPrediction’ or ‘RelationPrediction’).
during_training – Whether evaluation is during training.
- Returns:
Dictionary of evaluation metrics, or None if evaluation is skipped.
- eval_rank_of_head_and_tail_entity(*, train_set, valid_set=None, test_set=None, trained_model) None
Evaluate with negative sampling scoring.
- eval_rank_of_head_and_tail_byte_pair_encoded_entity(*, train_set=None, valid_set=None, test_set=None, ordered_bpe_entities, trained_model) None
Evaluate with BPE-encoded entities and negative sampling.
- eval_with_byte(*, raw_train_set, raw_valid_set=None, raw_test_set=None, trained_model, form_of_labelling) None
Evaluate BytE model with generation.
- eval_with_bpe_vs_all(*, raw_train_set, raw_valid_set=None, raw_test_set=None, trained_model, form_of_labelling) None
Evaluate with BPE and KvsAll scoring.
- eval_with_vs_all(*, train_set, valid_set=None, test_set=None, trained_model, form_of_labelling) None
Evaluate with KvsAll or 1vsAll scoring.
- evaluate_lp_k_vs_all(model, triple_idx, info: str = None, form_of_labelling: str = None) Dict[str, float]
Filtered link prediction evaluation with KvsAll scoring.
- Parameters:
model – The trained model to evaluate.
triple_idx – Integer-indexed test triples.
info – Description to print.
form_of_labelling – ‘EntityPrediction’ or ‘RelationPrediction’.
- Returns:
Dictionary with H@1, H@3, H@10, and MRR metrics.
- evaluate_lp_with_byte(model, triples: List[List[str]], info: str = None) Dict[str, float]
Evaluate BytE model with text generation.
- Parameters:
model – BytE model.
triples – String triples.
info – Description to print.
- Returns:
Dictionary with placeholder metrics (-1 values).
- evaluate_lp_bpe_k_vs_all(model, triples: List[List[str]], info: str = None, form_of_labelling: str = None) Dict[str, float]
Evaluate BPE model with KvsAll scoring.
- Parameters:
model – BPE-enabled model.
triples – String triples.
info – Description to print.
form_of_labelling – Type of labelling.
- Returns:
Dictionary with H@1, H@3, H@10, and MRR metrics.
- evaluate_lp(model, triple_idx, info: str) Dict[str, float]
Evaluate link prediction with negative sampling.
- Parameters:
model – The model to evaluate.
triple_idx – Integer-indexed triples.
info – Description to print.
- Returns:
Dictionary with H@1, H@3, H@10, and MRR metrics.
- dummy_eval(trained_model, form_of_labelling: str) None
Run evaluation from saved data (for continual training).
- Parameters:
trained_model – The trained model.
form_of_labelling – Type of labelling.
- eval_with_data(dataset, trained_model, triple_idx: numpy.ndarray, form_of_labelling: str) Dict[str, float]
Evaluate a trained model on a given dataset.
- Parameters:
dataset – Knowledge graph dataset.
trained_model – The trained model.
triple_idx – Integer-indexed triples to evaluate.
form_of_labelling – Type of labelling.
- Returns:
Dictionary with evaluation metrics.
- Raises:
ValueError – If scoring technique is invalid.
- dicee.__version__ = '0.1.5'