dicee.dataset_classes

Dataset classes for knowledge graph embedding training.

This package groups the various torch.utils.data.Dataset implementations used throughout DICE Embeddings into thematic sub-modules:

  • _bpe – Byte-pair-encoding related datasets.

  • _negative_sampling – Negative sampling based datasets.

  • _label_based – Multi-label / multi-class scoring datasets.

  • _literal – Literal (numeric) embedding dataset.

  • _factoryconstruct_dataset / reload_dataset helpers.

All public names are re-exported here so that existing from dicee.dataset_classes import statements continue to work unchanged.

Classes

BPE_NegativeSamplingDataset

Dataset for negative sampling with byte-pair encoded triples.

MultiClassClassificationDataset

Dataset for autoregressive multi-class classification on sub-word units.

MultiLabelDataset

Multi-label dataset for BPE-based KvsAll / AllvsAll training.

AllvsAll

Dataset for AllvsAll training (multi-label, exhaustive).

KvsAll

Dataset for KvsAll training (multi-label).

KvsSampleDataset

Dataset for KvsSample training (dynamic multi-label).

OnevsAllDataset

Dataset for the 1-vs-All training strategy (multi-class).

FixedNegSampleDataset

Pre-computed (fixed) negative sampling dataset.

OnevsSample

Dataset for 1-vs-Sample training (dynamic multi-class with negatives).

TriplePredictionDataset

Dataset for triple prediction with on-the-fly negative sampling.

LiteralDataset

Dataset for loading and processing literal data for Literal Embedding models.

Functions

construct_dataset(→ torch.utils.data.Dataset)

Build the appropriate dataset for the given training configuration.

reload_dataset(path, form_of_labelling, ...)

Reload training data from disk and construct a Pytorch dataset.

Package Contents

class dicee.dataset_classes.BPE_NegativeSamplingDataset(train_set: torch.LongTensor, ordered_shaped_bpe_entities: torch.LongTensor, neg_ratio: int)[source]

Bases: torch.utils.data.Dataset

Dataset for negative sampling with byte-pair encoded triples.

Each sample is a BPE-encoded triple. The custom collate_fn constructs negatives by corrupting head or tail entities with random BPE entities.

Parameters:
  • train_set (torch.LongTensor) – Integer-encoded triples of shape (N, 3).

  • ordered_shaped_bpe_entities (torch.LongTensor) – All BPE entity representations, ordered by entity index.

  • neg_ratio (int) – Number of negative samples per positive triple.

train_set
ordered_bpe_entities
num_bpe_entities
neg_ratio
num_datapoints
__len__()[source]
__getitem__(idx)[source]
collate_fn(batch_shaped_bpe_triples: List[Tuple[torch.Tensor, torch.Tensor]])[source]
class dicee.dataset_classes.MultiClassClassificationDataset(subword_units: numpy.ndarray, block_size: int = 8)[source]

Bases: torch.utils.data.Dataset

Dataset for autoregressive multi-class classification on sub-word units.

Splits a flat sequence of sub-word token ids into overlapping windows of size block_size for next-token prediction.

Parameters:
  • subword_units (numpy.ndarray) – 1-D array of sub-word token ids.

  • block_size (int, optional) – Context window length (default 8).

train_data
block_size = 8
num_of_data_points
collate_fn = None
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.MultiLabelDataset(train_set: torch.LongTensor, train_indices_target: torch.LongTensor, target_dim: int, torch_ordered_shaped_bpe_entities: torch.LongTensor)[source]

Bases: torch.utils.data.Dataset

Multi-label dataset for BPE-based KvsAll / AllvsAll training.

Each sample is a BPE-encoded (head, relation) pair together with a binary multi-label target vector over all entities.

Parameters:
  • train_set (torch.LongTensor) – BPE-encoded input pairs of shape (N, 2, token_length).

  • train_indices_target (torch.LongTensor) – Per-sample lists of positive target entity indices.

  • target_dim (int) – Dimensionality of the target vector (number of entities).

  • torch_ordered_shaped_bpe_entities (torch.LongTensor) – Ordered BPE entity representations.

train_set
train_indices_target
target_dim
num_datapoints
torch_ordered_shaped_bpe_entities
collate_fn = None
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.AllvsAll(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, label_smoothing_rate=0.0)[source]

Bases: torch.utils.data.Dataset

Dataset for AllvsAll training (multi-label, exhaustive).

Extends the KvsAll idea: every possible (entity, relation) combination is included — not just those observed in the KG. Pairs without any known tail entities receive an all-zeros label vector.

Parameters:
  • train_set_idx (numpy.ndarray) – (N, 3) integer-indexed triples.

  • entity_idxs (dict) – Entity-name → index mapping.

  • relation_idxs (dict) – Relation-name → index mapping.

  • label_smoothing_rate (float, optional) – Label smoothing coefficient (default 0.0).

train_data = None
train_target = None
label_smoothing_rate
collate_fn = None
target_dim
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.KvsAll(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, form, store=None, label_smoothing_rate: float = 0.0)[source]

Bases: torch.utils.data.Dataset

Dataset for KvsAll training (multi-label).

D := {(x, y)_i}_{i=1}^{N} where
  • x = (h, r) is a unique (entity, relation) pair observed in the KG,

  • y ∈ [0, 1]^{|E|} is a multi-label vector with y_j = 1 iff (h, r, e_j) ∈ KG.

Parameters:
  • train_set_idx (numpy.ndarray) – (N, 3) integer-indexed triples.

  • entity_idxs (dict) – Entity-name → index mapping.

  • relation_idxs (dict) – Relation-name → index mapping.

  • form (str) – 'EntityPrediction' or 'RelationPrediction'.

  • label_smoothing_rate (float, optional) – Label smoothing coefficient (default 0.0).

train_data = None
train_target = None
label_smoothing_rate
collate_fn = None
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.KvsSampleDataset(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, form, store=None, neg_ratio=None, label_smoothing_rate: float = 0.0)[source]

Bases: torch.utils.data.Dataset

Dataset for KvsSample training (dynamic multi-label).

Like KvsAll but sub-samples the target vector at each access to keep mini-batch sizes manageable when the entity set is large.

Parameters:
  • train_set_idx (numpy.ndarray) – (N, 3) integer-indexed triples.

  • entity_idxs (dict) – Entity-name → index mapping.

  • relation_idxs (dict) – Relation-name → index mapping.

  • form (str) – 'EntityPrediction'.

  • neg_ratio (int) – Number of negative samples per positive target.

  • label_smoothing_rate (float, optional) – Label smoothing coefficient (default 0.0).

train_data = None
train_target = None
neg_ratio = None
num_entities
label_smoothing_rate
collate_fn = None
max_num_of_classes
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.OnevsAllDataset(train_set_idx: numpy.ndarray, entity_idxs)[source]

Bases: torch.utils.data.Dataset

Dataset for the 1-vs-All training strategy (multi-class).

Each sample is a (head, relation) pair with a one-hot target vector whose single active position corresponds to the true tail entity.

Parameters:
  • train_set_idx (numpy.ndarray) – (N, 3) integer-indexed triples.

  • entity_idxs (dict) – Entity-name → index mapping (used to determine the target dimension).

train_data
target_dim
collate_fn = None
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.FixedNegSampleDataset(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = 1, label_smoothing_rate: float = 0.0, seed: int = None)[source]

Bases: torch.utils.data.Dataset

Pre-computed (fixed) negative sampling dataset.

At construction time every positive triple is paired with one random negative (head- or tail-corrupted) using vectorized operations for efficiency. The pairs are stored so that __getitem__ is a simple lookup.

This is useful when you want deterministic negatives across epochs (e.g., for reproducibility or debugging).

Parameters:
  • train_set (numpy.ndarray) – (N, 3) integer-indexed triples.

  • num_entities (int) – Total number of entities.

  • num_relations (int) – Total number of relations.

  • neg_sample_ratio (int, optional) – Number of negative samples per positive triple (default 1).

  • label_smoothing_rate (float, optional) – Label smoothing coefficient (default 0.0).

neg_sample_ratio = 1
num_entities
num_relations
label_smoothing_rate = 0.0
collate_fn = None
seed = None
train_triples
length
__len__() int[source]
__getitem__(idx: int) Tuple[torch.Tensor, torch.Tensor][source]
class dicee.dataset_classes.OnevsSample(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = None, label_smoothing_rate: float = 0.0)[source]

Bases: torch.utils.data.Dataset

Dataset for 1-vs-Sample training (dynamic multi-class with negatives).

For every positive triple (h, r, t) the dataset draws neg_sample_ratio random entities as negatives and returns a label vector that marks the true tail and the negatives.

Parameters:
  • train_set (numpy.ndarray) – (N, 3) integer-indexed triples.

  • num_entities (int) – Total number of entities.

  • num_relations (int) – Total number of relations.

  • neg_sample_ratio (int) – Number of negative samples per positive.

  • label_smoothing_rate (float, optional) – Label smoothing coefficient (default 0.0).

train_data
num_entities
num_relations
neg_sample_ratio = None
label_smoothing_rate
collate_fn = None
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.TriplePredictionDataset(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = 1, label_smoothing_rate: float = 0.0, seed: int = None)[source]

Bases: torch.utils.data.Dataset

Dataset for triple prediction with on-the-fly negative sampling.

Each item is a single positive triple; the custom collate_fn generates a batch of mixed positive and negative triples.

Parameters:
  • train_set (numpy.ndarray) – (N, 3) integer-indexed triples.

  • num_entities (int) – Total number of entities.

  • num_relations (int) – Total number of relations.

  • neg_sample_ratio (int, optional) – Number of negative samples per positive triple (default 1).

  • label_smoothing_rate (float, optional) – Label smoothing coefficient (default 0.0).

label_smoothing_rate
neg_sample_ratio
seed = None
train_set
length
num_entities
num_relations
__len__()[source]
__getitem__(idx)[source]
collate_fn(batch: List[torch.Tensor])[source]
class dicee.dataset_classes.LiteralDataset(file_path: str, ent_idx: dict = None, normalization_type: str = 'z-norm', sampling_ratio: float = None, loader_backend: str = 'pandas')[source]

Bases: torch.utils.data.Dataset

Dataset for loading and processing literal data for Literal Embedding models.

Handles loading, normalization, and preparation of (entity, attribute, value) triples. Supports z-score and min-max normalization as well as optional sub-sampling for ablation studies.

Parameters:
  • file_path (str) – Path to the training data file (CSV / TSV / RDF).

  • ent_idx (dict) – Entity-name → index mapping.

  • normalization_type (str, optional) – 'z-norm', 'min-max', or None (default 'z-norm').

  • sampling_ratio (float or None, optional) – Fraction of the training set to keep (default None = use all).

  • loader_backend (str, optional) – 'pandas' or 'rdflib' (default 'pandas').

train_file_path
loader_backend = 'pandas'
normalization_type = 'z-norm'
normalization_params
sampling_ratio = None
entity_to_idx = None
num_entities
__getitem__(index)[source]
__len__()[source]
static load_and_validate_literal_data(file_path: str = None, loader_backend: str = 'pandas') pandas.DataFrame[source]

Load and validate a literal data file.

Parameters:
  • file_path (str) – Path to the data file.

  • loader_backend (str) – 'pandas' or 'rdflib'.

Returns:

Three-column DataFrame with columns head, attribute, value.

Return type:

pandas.DataFrame

static denormalize(preds_norm, attributes, normalization_params) numpy.ndarray[source]

Reverse the normalization applied during training.

Parameters:
  • preds_norm (numpy.ndarray) – Normalized predictions.

  • attributes (list) – Attribute names corresponding to each prediction.

  • normalization_params (dict) – Parameters stored during training.

Returns:

Denormalized predictions.

Return type:

numpy.ndarray

dicee.dataset_classes.construct_dataset(*, train_set: numpy.ndarray | list, valid_set=None, test_set=None, ordered_bpe_entities=None, train_target_indices=None, target_dim: int = None, entity_to_idx: dict, relation_to_idx: dict, form_of_labelling: str, scoring_technique: str, neg_ratio: int, label_smoothing_rate: float, byte_pair_encoding=None, block_size: int = None, seed: int = None) torch.utils.data.Dataset[source]

Build the appropriate dataset for the given training configuration.

Parameters:
  • train_set (numpy.ndarray or list) – Raw integer-indexed triples.

  • entity_to_idx (dict) – Name → index mappings.

  • relation_to_idx (dict) – Name → index mappings.

  • form_of_labelling (str) – 'EntityPrediction' or 'RelationPrediction'.

  • scoring_technique (str) – One of 'NegSample', 'FixedNegSample', '1vsAll', '1vsSample', 'KvsAll', 'AllvsAll', 'KvsSample'.

  • neg_ratio (int) – Negative sample ratio.

  • label_smoothing_rate (float) – Label smoothing coefficient.

Return type:

torch.utils.data.Dataset

dicee.dataset_classes.reload_dataset(path: str, form_of_labelling, scoring_technique, neg_ratio, label_smoothing_rate)[source]

Reload training data from disk and construct a Pytorch dataset.