dicee.dataset_classes
Dataset classes for knowledge graph embedding training.
This package groups the various torch.utils.data.Dataset implementations
used throughout DICE Embeddings into thematic sub-modules:
_bpe– Byte-pair-encoding related datasets._negative_sampling– Negative sampling based datasets._label_based– Multi-label / multi-class scoring datasets._literal– Literal (numeric) embedding dataset._factory–construct_dataset/reload_datasethelpers.
All public names are re-exported here so that existing from
dicee.dataset_classes import … statements continue to work unchanged.
Classes
Dataset for negative sampling with byte-pair encoded triples. |
|
Dataset for autoregressive multi-class classification on sub-word units. |
|
Multi-label dataset for BPE-based KvsAll / AllvsAll training. |
|
Dataset for AllvsAll training (multi-label, exhaustive). |
|
Dataset for KvsAll training (multi-label). |
|
Dataset for KvsSample training (dynamic multi-label). |
|
Dataset for the 1-vs-All training strategy (multi-class). |
|
Pre-computed (fixed) negative sampling dataset. |
|
Dataset for 1-vs-Sample training (dynamic multi-class with negatives). |
|
Dataset for triple prediction with on-the-fly negative sampling. |
|
Dataset for loading and processing literal data for Literal Embedding models. |
Functions
|
Build the appropriate dataset for the given training configuration. |
|
Reload training data from disk and construct a Pytorch dataset. |
Package Contents
- class dicee.dataset_classes.BPE_NegativeSamplingDataset(train_set: torch.LongTensor, ordered_shaped_bpe_entities: torch.LongTensor, neg_ratio: int)[source]
Bases:
torch.utils.data.DatasetDataset for negative sampling with byte-pair encoded triples.
Each sample is a BPE-encoded triple. The custom
collate_fnconstructs negatives by corrupting head or tail entities with random BPE entities.- Parameters:
train_set (torch.LongTensor) – Integer-encoded triples of shape
(N, 3).ordered_shaped_bpe_entities (torch.LongTensor) – All BPE entity representations, ordered by entity index.
neg_ratio (int) – Number of negative samples per positive triple.
- train_set
- ordered_bpe_entities
- num_bpe_entities
- neg_ratio
- num_datapoints
- class dicee.dataset_classes.MultiClassClassificationDataset(subword_units: numpy.ndarray, block_size: int = 8)[source]
Bases:
torch.utils.data.DatasetDataset for autoregressive multi-class classification on sub-word units.
Splits a flat sequence of sub-word token ids into overlapping windows of size
block_sizefor next-token prediction.- Parameters:
subword_units (numpy.ndarray) – 1-D array of sub-word token ids.
block_size (int, optional) – Context window length (default
8).
- train_data
- block_size = 8
- num_of_data_points
- collate_fn = None
- class dicee.dataset_classes.MultiLabelDataset(train_set: torch.LongTensor, train_indices_target: torch.LongTensor, target_dim: int, torch_ordered_shaped_bpe_entities: torch.LongTensor)[source]
Bases:
torch.utils.data.DatasetMulti-label dataset for BPE-based KvsAll / AllvsAll training.
Each sample is a BPE-encoded
(head, relation)pair together with a binary multi-label target vector over all entities.- Parameters:
train_set (torch.LongTensor) – BPE-encoded input pairs of shape
(N, 2, token_length).train_indices_target (torch.LongTensor) – Per-sample lists of positive target entity indices.
target_dim (int) – Dimensionality of the target vector (number of entities).
torch_ordered_shaped_bpe_entities (torch.LongTensor) – Ordered BPE entity representations.
- train_set
- train_indices_target
- target_dim
- num_datapoints
- torch_ordered_shaped_bpe_entities
- collate_fn = None
- class dicee.dataset_classes.AllvsAll(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, label_smoothing_rate=0.0)[source]
Bases:
torch.utils.data.DatasetDataset for AllvsAll training (multi-label, exhaustive).
Extends the
KvsAllidea: every possible(entity, relation)combination is included — not just those observed in the KG. Pairs without any known tail entities receive an all-zeros label vector.- Parameters:
train_set_idx (numpy.ndarray) –
(N, 3)integer-indexed triples.entity_idxs (dict) – Entity-name → index mapping.
relation_idxs (dict) – Relation-name → index mapping.
label_smoothing_rate (float, optional) – Label smoothing coefficient (default
0.0).
- train_data = None
- train_target = None
- label_smoothing_rate
- collate_fn = None
- target_dim
- class dicee.dataset_classes.KvsAll(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, form, store=None, label_smoothing_rate: float = 0.0)[source]
Bases:
torch.utils.data.DatasetDataset for KvsAll training (multi-label).
- D := {(x, y)_i}_{i=1}^{N} where
x = (h, r) is a unique (entity, relation) pair observed in the KG,
y ∈ [0, 1]^{|E|} is a multi-label vector with y_j = 1 iff (h, r, e_j) ∈ KG.
- Parameters:
train_set_idx (numpy.ndarray) –
(N, 3)integer-indexed triples.entity_idxs (dict) – Entity-name → index mapping.
relation_idxs (dict) – Relation-name → index mapping.
form (str) –
'EntityPrediction'or'RelationPrediction'.label_smoothing_rate (float, optional) – Label smoothing coefficient (default
0.0).
- train_data = None
- train_target = None
- label_smoothing_rate
- collate_fn = None
- class dicee.dataset_classes.KvsSampleDataset(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, form, store=None, neg_ratio=None, label_smoothing_rate: float = 0.0)[source]
Bases:
torch.utils.data.DatasetDataset for KvsSample training (dynamic multi-label).
Like
KvsAllbut sub-samples the target vector at each access to keep mini-batch sizes manageable when the entity set is large.- Parameters:
train_set_idx (numpy.ndarray) –
(N, 3)integer-indexed triples.entity_idxs (dict) – Entity-name → index mapping.
relation_idxs (dict) – Relation-name → index mapping.
form (str) –
'EntityPrediction'.neg_ratio (int) – Number of negative samples per positive target.
label_smoothing_rate (float, optional) – Label smoothing coefficient (default
0.0).
- train_data = None
- train_target = None
- neg_ratio = None
- num_entities
- label_smoothing_rate
- collate_fn = None
- max_num_of_classes
- class dicee.dataset_classes.OnevsAllDataset(train_set_idx: numpy.ndarray, entity_idxs)[source]
Bases:
torch.utils.data.DatasetDataset for the 1-vs-All training strategy (multi-class).
Each sample is a
(head, relation)pair with a one-hot target vector whose single active position corresponds to the true tail entity.- Parameters:
train_set_idx (numpy.ndarray) –
(N, 3)integer-indexed triples.entity_idxs (dict) – Entity-name → index mapping (used to determine the target dimension).
- train_data
- target_dim
- collate_fn = None
- class dicee.dataset_classes.FixedNegSampleDataset(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = 1, label_smoothing_rate: float = 0.0, seed: int = None)[source]
Bases:
torch.utils.data.DatasetPre-computed (fixed) negative sampling dataset.
At construction time every positive triple is paired with one random negative (head- or tail-corrupted) using vectorized operations for efficiency. The pairs are stored so that
__getitem__is a simple lookup.This is useful when you want deterministic negatives across epochs (e.g., for reproducibility or debugging).
- Parameters:
train_set (numpy.ndarray) –
(N, 3)integer-indexed triples.num_entities (int) – Total number of entities.
num_relations (int) – Total number of relations.
neg_sample_ratio (int, optional) – Number of negative samples per positive triple (default
1).label_smoothing_rate (float, optional) – Label smoothing coefficient (default
0.0).
- neg_sample_ratio = 1
- num_entities
- num_relations
- label_smoothing_rate = 0.0
- collate_fn = None
- seed = None
- train_triples
- length
- class dicee.dataset_classes.OnevsSample(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = None, label_smoothing_rate: float = 0.0)[source]
Bases:
torch.utils.data.DatasetDataset for 1-vs-Sample training (dynamic multi-class with negatives).
For every positive triple
(h, r, t)the dataset drawsneg_sample_ratiorandom entities as negatives and returns a label vector that marks the true tail and the negatives.- Parameters:
train_set (numpy.ndarray) –
(N, 3)integer-indexed triples.num_entities (int) – Total number of entities.
num_relations (int) – Total number of relations.
neg_sample_ratio (int) – Number of negative samples per positive.
label_smoothing_rate (float, optional) – Label smoothing coefficient (default
0.0).
- train_data
- num_entities
- num_relations
- neg_sample_ratio = None
- label_smoothing_rate
- collate_fn = None
- class dicee.dataset_classes.TriplePredictionDataset(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = 1, label_smoothing_rate: float = 0.0, seed: int = None)[source]
Bases:
torch.utils.data.DatasetDataset for triple prediction with on-the-fly negative sampling.
Each item is a single positive triple; the custom
collate_fngenerates a batch of mixed positive and negative triples.- Parameters:
train_set (numpy.ndarray) –
(N, 3)integer-indexed triples.num_entities (int) – Total number of entities.
num_relations (int) – Total number of relations.
neg_sample_ratio (int, optional) – Number of negative samples per positive triple (default
1).label_smoothing_rate (float, optional) – Label smoothing coefficient (default
0.0).
- label_smoothing_rate
- neg_sample_ratio
- seed = None
- train_set
- length
- num_entities
- num_relations
- class dicee.dataset_classes.LiteralDataset(file_path: str, ent_idx: dict = None, normalization_type: str = 'z-norm', sampling_ratio: float = None, loader_backend: str = 'pandas')[source]
Bases:
torch.utils.data.DatasetDataset for loading and processing literal data for Literal Embedding models.
Handles loading, normalization, and preparation of
(entity, attribute, value)triples. Supports z-score and min-max normalization as well as optional sub-sampling for ablation studies.- Parameters:
file_path (str) – Path to the training data file (CSV / TSV / RDF).
ent_idx (dict) – Entity-name → index mapping.
normalization_type (str, optional) –
'z-norm','min-max', orNone(default'z-norm').sampling_ratio (float or None, optional) – Fraction of the training set to keep (default
None= use all).loader_backend (str, optional) –
'pandas'or'rdflib'(default'pandas').
- train_file_path
- loader_backend = 'pandas'
- normalization_type = 'z-norm'
- normalization_params
- sampling_ratio = None
- entity_to_idx = None
- num_entities
- static load_and_validate_literal_data(file_path: str = None, loader_backend: str = 'pandas') pandas.DataFrame[source]
Load and validate a literal data file.
- Parameters:
file_path (str) – Path to the data file.
loader_backend (str) –
'pandas'or'rdflib'.
- Returns:
Three-column DataFrame with columns
head,attribute,value.- Return type:
pandas.DataFrame
- static denormalize(preds_norm, attributes, normalization_params) numpy.ndarray[source]
Reverse the normalization applied during training.
- Parameters:
preds_norm (numpy.ndarray) – Normalized predictions.
attributes (list) – Attribute names corresponding to each prediction.
normalization_params (dict) – Parameters stored during training.
- Returns:
Denormalized predictions.
- Return type:
numpy.ndarray
- dicee.dataset_classes.construct_dataset(*, train_set: numpy.ndarray | list, valid_set=None, test_set=None, ordered_bpe_entities=None, train_target_indices=None, target_dim: int = None, entity_to_idx: dict, relation_to_idx: dict, form_of_labelling: str, scoring_technique: str, neg_ratio: int, label_smoothing_rate: float, byte_pair_encoding=None, block_size: int = None, seed: int = None) torch.utils.data.Dataset[source]
Build the appropriate dataset for the given training configuration.
- Parameters:
train_set (numpy.ndarray or list) – Raw integer-indexed triples.
entity_to_idx (dict) – Name → index mappings.
relation_to_idx (dict) – Name → index mappings.
form_of_labelling (str) –
'EntityPrediction'or'RelationPrediction'.scoring_technique (str) – One of
'NegSample','FixedNegSample','1vsAll','1vsSample','KvsAll','AllvsAll','KvsSample'.neg_ratio (int) – Negative sample ratio.
label_smoothing_rate (float) – Label smoothing coefficient.
- Return type:
torch.utils.data.Dataset