dicee.dataset_classes

Classes

BPE_NegativeSamplingDataset

An abstract class representing a Dataset.

MultiLabelDataset

An abstract class representing a Dataset.

MultiClassClassificationDataset

Dataset for the 1vsALL training strategy

OnevsAllDataset

Dataset for the 1vsALL training strategy

KvsAll

Creates a dataset for KvsAll training by inheriting from torch.utils.data.Dataset.

AllvsAll

Creates a dataset for AllvsAll training by inheriting from torch.utils.data.Dataset.

OnevsSample

A custom PyTorch Dataset class for knowledge graph embeddings, which includes

KvsSampleDataset

KvsSample a Dataset:

NegSampleDataset

An abstract class representing a Dataset.

TriplePredictionDataset

Triple Dataset

LiteralDataset

Dataset for loading and processing literal data for training Literal Embedding model.

Functions

reload_dataset(path, form_of_labelling, ...)

Reload the files from disk to construct the Pytorch dataset

construct_dataset(→ torch.utils.data.Dataset)

Module Contents

dicee.dataset_classes.reload_dataset(path: str, form_of_labelling, scoring_technique, neg_ratio, label_smoothing_rate)[source]

Reload the files from disk to construct the Pytorch dataset

dicee.dataset_classes.construct_dataset(*, train_set: numpy.ndarray | list, valid_set=None, test_set=None, ordered_bpe_entities=None, train_target_indices=None, target_dim: int = None, entity_to_idx: dict, relation_to_idx: dict, form_of_labelling: str, scoring_technique: str, neg_ratio: int, label_smoothing_rate: float, byte_pair_encoding=None, block_size: int = None) torch.utils.data.Dataset[source]
class dicee.dataset_classes.BPE_NegativeSamplingDataset(train_set: torch.LongTensor, ordered_shaped_bpe_entities: torch.LongTensor, neg_ratio: int)[source]

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader. Subclasses could also optionally implement __getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

Note

DataLoader by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

train_set
ordered_bpe_entities
num_bpe_entities
neg_ratio
num_datapoints
__len__()[source]
__getitem__(idx)[source]
collate_fn(batch_shaped_bpe_triples: List[Tuple[torch.Tensor, torch.Tensor]])[source]
class dicee.dataset_classes.MultiLabelDataset(train_set: torch.LongTensor, train_indices_target: torch.LongTensor, target_dim: int, torch_ordered_shaped_bpe_entities: torch.LongTensor)[source]

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader. Subclasses could also optionally implement __getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

Note

DataLoader by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

train_set
train_indices_target
target_dim
num_datapoints
torch_ordered_shaped_bpe_entities
collate_fn = None
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.MultiClassClassificationDataset(subword_units: numpy.ndarray, block_size: int = 8)[source]

Bases: torch.utils.data.Dataset

Dataset for the 1vsALL training strategy

Parameters:
Return type:

torch.utils.data.Dataset

train_data
block_size = 8
num_of_data_points
collate_fn = None
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.OnevsAllDataset(train_set_idx: numpy.ndarray, entity_idxs)[source]

Bases: torch.utils.data.Dataset

Dataset for the 1vsALL training strategy

Parameters:
Return type:

torch.utils.data.Dataset

train_data
target_dim
collate_fn = None
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.KvsAll(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, form, store=None, label_smoothing_rate: float = 0.0)[source]

Bases: torch.utils.data.Dataset

Creates a dataset for KvsAll training by inheriting from torch.utils.data.Dataset.

Let D denote a dataset for KvsAll training and be defined as D:= {(x,y)_i}_i ^N, where x: (h,r) is an unique tuple of an entity h in E and a relation r in R that has been seed in the input graph. y: denotes a multi-label vector in [0,1]^{|E|} is a binary label.

orall y_i =1 s.t. (h r E_i) in KG

Note

TODO

train_set_idxnumpy.ndarray

n by 3 array representing n triples

entity_idxsdictonary

string representation of an entity to its integer id

relation_idxsdictonary

string representation of a relation to its integer id

self : torch.utils.data.Dataset

>>> a = KvsAll()
>>> a
? array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
train_data = None
train_target = None
label_smoothing_rate
collate_fn = None
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.AllvsAll(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, label_smoothing_rate=0.0)[source]

Bases: torch.utils.data.Dataset

Creates a dataset for AllvsAll training by inheriting from torch.utils.data.Dataset.

Let D denote a dataset for AllvsAll training and be defined as D:= {(x,y)_i}_i ^N, where x: (h,r) is a possible unique tuple of an entity h in E and a relation r in R. Hence N = |E| x |R| y: denotes a multi-label vector in [0,1]^{|E|} is a binary label.

orall y_i =1 s.t. (h r E_i) in KG

Note

AllvsAll extends KvsAll via none existing (h,r). Hence, it adds data points that are labelled without 1s,

only with 0s.

train_set_idxnumpy.ndarray

n by 3 array representing n triples

entity_idxsdictonary

string representation of an entity to its integer id

relation_idxsdictonary

string representation of a relation to its integer id

self : torch.utils.data.Dataset

>>> a = AllvsAll()
>>> a
? array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
train_data = None
train_target = None
label_smoothing_rate
collate_fn = None
target_dim
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.OnevsSample(train_set: numpy.ndarray, num_entities, num_relations, neg_sample_ratio: int = None, label_smoothing_rate: float = 0.0)[source]

Bases: torch.utils.data.Dataset

A custom PyTorch Dataset class for knowledge graph embeddings, which includes both positive and negative sampling for a given dataset for multi-class classification problem..

Parameters:
  • train_set (np.ndarray) – A numpy array containing triples of knowledge graph data. Each triple consists of (head_entity, relation, tail_entity).

  • num_entities (int) – The number of unique entities in the knowledge graph.

  • num_relations (int) – The number of unique relations in the knowledge graph.

  • neg_sample_ratio (int, optional) – The number of negative samples to be generated per positive sample. Must be a positive integer and less than num_entities.

  • label_smoothing_rate (float, optional) – A label smoothing rate to apply to the positive and negative labels. Defaults to 0.0.

train_data

The input data converted into a PyTorch tensor.

Type:

torch.Tensor

num_entities

Number of entities in the dataset.

Type:

int

num_relations

Number of relations in the dataset.

Type:

int

neg_sample_ratio

Ratio of negative samples to be drawn for each positive sample.

Type:

int

label_smoothing_rate

The smoothing factor applied to the labels.

Type:

torch.Tensor

collate_fn

A function that can be used to collate data samples into batches (set to None by default).

Type:

function, optional

train_data
num_entities
num_relations
neg_sample_ratio = None
label_smoothing_rate
collate_fn = None
__len__()[source]

Returns the number of samples in the dataset.

__getitem__(idx)[source]

Retrieves a single data sample from the dataset at the given index.

Parameters:

idx (int) – The index of the sample to retrieve.

Returns:

A tuple consisting of:
  • x (torch.Tensor): The head and relation part of the triple.

  • y_idx (torch.Tensor): The concatenated indices of the true object (tail entity) and the indices of the negative samples.

  • y_vec (torch.Tensor): A vector containing the labels for the positive and negative samples, with label smoothing applied.

Return type:

tuple

class dicee.dataset_classes.KvsSampleDataset(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, form, store=None, neg_ratio=None, label_smoothing_rate: float = 0.0)[source]

Bases: torch.utils.data.Dataset

KvsSample a Dataset:
D:= {(x,y)_i}_i ^N, where

. x:(h,r) is a unique h in E and a relation r in R and . y in [0,1]^{|E|} is a binary label.

orall y_i =1 s.t. (h r E_i) in KG
At each mini-batch construction, we subsample(y), hence n

|new_y| << |E| new_y contains all 1’s if sum(y)< neg_sample ratio new_y contains

train_set_idx

Indexed triples for the training.

entity_idxs

mapping.

relation_idxs

mapping.

form

?

store

?

label_smoothing_rate

?

torch.utils.data.Dataset

train_data = None
train_target = None
neg_ratio = None
num_entities
label_smoothing_rate
collate_fn = None
max_num_of_classes
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.NegSampleDataset(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = 1)[source]

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader. Subclasses could also optionally implement __getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

Note

DataLoader by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

neg_sample_ratio
train_triples
length
num_entities
num_relations
labels
train_set = []
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.TriplePredictionDataset(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = 1, label_smoothing_rate: float = 0.0)[source]

Bases: torch.utils.data.Dataset

Triple Dataset

D:= {(x)_i}_i ^N, where

. x:(h,r, t) in KG is a unique h in E and a relation r in R and . collact_fn => Generates negative triples

collect_fn:

orall (h,r,t) in G obtain, create negative triples{(h,r,x),(,r,t),(h,m,t)}

y:labels are represented in torch.float16

train_set_idx

Indexed triples for the training.

entity_idxs

mapping.

relation_idxs

mapping.

form

?

store

?

label_smoothing_rate

collate_fn: batch:List[torch.IntTensor] Returns ——- torch.utils.data.Dataset

label_smoothing_rate
neg_sample_ratio
train_set
length
num_entities
num_relations
__len__()[source]
__getitem__(idx)[source]
collate_fn(batch: List[torch.Tensor])[source]
class dicee.dataset_classes.LiteralDataset(file_path: str, ent_idx: dict = None, normalization_type: str = 'z-norm', sampling_ratio: float = None, loader_backend: str = 'pandas')[source]

Bases: torch.utils.data.Dataset

Dataset for loading and processing literal data for training Literal Embedding model. This dataset handles the loading, normalization, and preparation of triples for training a literal embedding model.

Extends torch.utils.data.Dataset for supporting PyTorch dataloaders.

train_file_path

Path to the training data file.

Type:

str

normalization

Type of normalization to apply (‘z-norm’, ‘min-max’, or None).

Type:

str

normalization_params

Parameters used for normalization.

Type:

dict

sampling_ratio

Fraction of the training set to use for ablations.

Type:

float

entity_to_idx

Mapping of entities to their indices.

Type:

dict

num_entities

Total number of entities.

Type:

int

data_property_to_idx

Mapping of data properties to their indices.

Type:

dict

num_data_properties

Total number of data properties.

Type:

int

loader_backend

Backend to use for loading data (‘pandas’ or ‘rdflib’).

Type:

str

train_file_path
loader_backend = 'pandas'
normalization_type = 'z-norm'
normalization_params
sampling_ratio = None
entity_to_idx = None
num_entities
__getitem__(index)[source]
__len__()[source]
static load_and_validate_literal_data(file_path: str = None, loader_backend: str = 'pandas') pandas.DataFrame[source]

Loads and validates the literal data file. :param file_path: Path to the literal data file. :type file_path: str

Returns:

DataFrame containing the loaded and validated data.

Return type:

pd.DataFrame

static denormalize(preds_norm, attributes, normalization_params) numpy.ndarray[source]

Denormalizes the predictions based on the normalization type.

Args: preds_norm (np.ndarray): Normalized predictions to be denormalized. attributes (list): List of attributes corresponding to the predictions. normalization_params (dict): Dictionary containing normalization parameters for each attribute.

Returns:

Denormalized predictions.

Return type:

np.ndarray