dicee.dataset_classes

Classes

`BPE_NegativeSamplingDataset`	An abstract class representing a `Dataset`.
`MultiLabelDataset`	An abstract class representing a `Dataset`.
`MultiClassClassificationDataset`	Dataset for the 1vsALL training strategy
`OnevsAllDataset`	Dataset for the 1vsALL training strategy
`KvsAll`	Creates a dataset for KvsAll training by inheriting from torch.utils.data.Dataset.
`AllvsAll`	Creates a dataset for AllvsAll training by inheriting from torch.utils.data.Dataset.
`OnevsSample`	A custom PyTorch Dataset class for knowledge graph embeddings, which includes
`KvsSampleDataset`	KvsSample a Dataset:
`NegSampleDataset`	An abstract class representing a `Dataset`.
`TriplePredictionDataset`	Triple Dataset
`CVDataModule`	Create a Dataset for cross validation
`LiteralDataset`	Dataset for loading and processing literal data for training Literal Embedding model.

Functions

`reload_dataset`(path, form_of_labelling, ...)	Reload the files from disk to construct the Pytorch dataset
`construct_dataset`(→ torch.utils.data.Dataset)

Module Contents

dicee.dataset_classes.reload_dataset(path: str, form_of_labelling, scoring_technique, neg_ratio, label_smoothing_rate)[source]: Reload the files from disk to construct the Pytorch dataset

dicee.dataset_classes.construct_dataset(*, train_set: numpy.ndarray | list, valid_set=None, test_set=None, ordered_bpe_entities=None, train_target_indices=None, target_dim: int = None, entity_to_idx: dict, relation_to_idx: dict, form_of_labelling: str, scoring_technique: str, neg_ratio: int, label_smoothing_rate: float, byte_pair_encoding=None, block_size: int = None) → torch.utils.data.Dataset[source]

class dicee.dataset_classes.BPE_NegativeSamplingDataset(train_set: torch.LongTensor, ordered_shaped_bpe_entities: torch.LongTensor, neg_ratio: int)[source]

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader. Subclasses could also optionally implement __getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

Note

DataLoader by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

train_set

ordered_bpe_entities

num_bpe_entities

neg_ratio

num_datapoints

__len__()[source]

__getitem__(idx)[source]

collate_fn(batch_shaped_bpe_triples: List[Tuple[torch.Tensor, torch.Tensor]])[source]

class dicee.dataset_classes.MultiLabelDataset(train_set: torch.LongTensor, train_indices_target: torch.LongTensor, target_dim: int, torch_ordered_shaped_bpe_entities: torch.LongTensor)[source]

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader. Subclasses could also optionally implement __getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

Note

DataLoader by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

train_set

train_indices_target

target_dim

num_datapoints

torch_ordered_shaped_bpe_entities

collate_fn = None

__len__()[source]

__getitem__(idx)[source]

class dicee.dataset_classes.MultiClassClassificationDataset(subword_units: numpy.ndarray, block_size: int = 8)[source]

Bases: torch.utils.data.Dataset

Dataset for the 1vsALL training strategy

Parameters:

train_set_idx – Indexed triples for the training.
entity_idxs – mapping.
relation_idxs – mapping.
form –
?
num_workers – int for https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

Return type:

torch.utils.data.Dataset

train_data

block_size = 8

num_of_data_points

collate_fn = None

__len__()[source]

__getitem__(idx)[source]

class dicee.dataset_classes.OnevsAllDataset(train_set_idx: numpy.ndarray, entity_idxs)[source]

Bases: torch.utils.data.Dataset

Dataset for the 1vsALL training strategy

Parameters:

train_set_idx – Indexed triples for the training.
entity_idxs – mapping.
relation_idxs – mapping.
form –
?
num_workers – int for https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

Return type:

torch.utils.data.Dataset

train_data

target_dim

collate_fn = None

__len__()[source]

__getitem__(idx)[source]

class dicee.dataset_classes.KvsAll(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, form, store=None, label_smoothing_rate: float = 0.0)[source]

Bases: torch.utils.data.Dataset

Creates a dataset for KvsAll training by inheriting from torch.utils.data.Dataset.: Let D denote a dataset for KvsAll training and be defined as D:= {(x,y)_i}_i ^N, where x: (h,r) is an unique tuple of an entity h in E and a relation r in R that has been seed in the input graph. y: denotes a multi-label vector in [0,1]^{|E|} is a binary label.

orall y_i =1 s.t. (h r E_i) in KG

Note

TODO

train_set_idxnumpy.ndarray
n by 3 array representing n triples

entity_idxsdictonary
string representation of an entity to its integer id

relation_idxsdictonary
string representation of a relation to its integer id

self : torch.utils.data.Dataset
>>> a = KvsAll()
>>> a
? array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

train_data = None

train_target = None

label_smoothing_rate

collate_fn = None

__len__()[source]

__getitem__(idx)[source]

class dicee.dataset_classes.AllvsAll(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, label_smoothing_rate=0.0)[source]

Bases: torch.utils.data.Dataset

Creates a dataset for AllvsAll training by inheriting from torch.utils.data.Dataset.: Let D denote a dataset for AllvsAll training and be defined as D:= {(x,y)_i}_i ^N, where x: (h,r) is a possible unique tuple of an entity h in E and a relation r in R. Hence N = |E| x |R| y: denotes a multi-label vector in [0,1]^{|E|} is a binary label.

orall y_i =1 s.t. (h r E_i) in KG

Note

AllvsAll extends KvsAll via none existing (h,r). Hence, it adds data points that are labelled without 1s,
only with 0s.

train_set_idxnumpy.ndarray
n by 3 array representing n triples

entity_idxsdictonary
string representation of an entity to its integer id

relation_idxsdictonary
string representation of a relation to its integer id

self : torch.utils.data.Dataset
>>> a = AllvsAll()
>>> a
? array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

train_data = None

train_target = None

label_smoothing_rate

collate_fn = None

target_dim

__len__()[source]

__getitem__(idx)[source]

class dicee.dataset_classes.OnevsSample(train_set: numpy.ndarray, num_entities, num_relations, neg_sample_ratio: int = None, label_smoothing_rate: float = 0.0)[source]

Bases: torch.utils.data.Dataset

A custom PyTorch Dataset class for knowledge graph embeddings, which includes both positive and negative sampling for a given dataset for multi-class classification problem..

Parameters:

train_set (np.ndarray) – A numpy array containing triples of knowledge graph data. Each triple consists of (head_entity, relation, tail_entity).
num_entities (int) – The number of unique entities in the knowledge graph.
num_relations (int) – The number of unique relations in the knowledge graph.
neg_sample_ratio (int, optional) – The number of negative samples to be generated per positive sample. Must be a positive integer and less than num_entities.
label_smoothing_rate (float, optional) – A label smoothing rate to apply to the positive and negative labels. Defaults to 0.0.

train_data

The input data converted into a PyTorch tensor.

Type:: torch.Tensor

num_entities

Number of entities in the dataset.

Type:: int

num_relations

Number of relations in the dataset.

Type:: int

neg_sample_ratio

Ratio of negative samples to be drawn for each positive sample.

Type:: int

label_smoothing_rate

The smoothing factor applied to the labels.

Type:: torch.Tensor

collate_fn

A function that can be used to collate data samples into batches (set to None by default).

Type:: function, optional

train_data

num_entities

num_relations

neg_sample_ratio = None

label_smoothing_rate

collate_fn = None

__len__()[source]: Returns the number of samples in the dataset.

__getitem__(idx)[source]

Retrieves a single data sample from the dataset at the given index.

Parameters:

idx (int) – The index of the sample to retrieve.

Returns:

A tuple consisting of:

x (torch.Tensor): The head and relation part of the triple.
y_idx (torch.Tensor): The concatenated indices of the true object (tail entity) and the indices of the negative samples.
y_vec (torch.Tensor): A vector containing the labels for the positive and negative samples, with label smoothing applied.

Return type:

tuple

class dicee.dataset_classes.KvsSampleDataset(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, form, store=None, neg_ratio=None, label_smoothing_rate: float = 0.0)[source]

Bases: torch.utils.data.Dataset

KvsSample a Dataset:

D:= {(x,y)_i}_i ^N, where
. x:(h,r) is a unique h in E and a relation r in R and . y in [0,1]^{|E|} is a binary label.

orall y_i =1 s.t. (h r E_i) in KG

At each mini-batch construction, we subsample(y), hence n
|new_y| << |E| new_y contains all 1’s if sum(y)< neg_sample ratio new_y contains

train_set_idx: Indexed triples for the training.
entity_idxs: mapping.
relation_idxs: mapping.
form: ?
store: ?
label_smoothing_rate: ?

torch.utils.data.Dataset

train_data = None

train_target = None

neg_ratio = None

num_entities

label_smoothing_rate

collate_fn = None

max_num_of_classes

__len__()[source]

__getitem__(idx)[source]

class dicee.dataset_classes.NegSampleDataset(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = 1)[source]

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader. Subclasses could also optionally implement __getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

Note

DataLoader by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

neg_sample_ratio

train_set

length

num_entities

num_relations

__len__()[source]

__getitem__(idx)[source]

class dicee.dataset_classes.TriplePredictionDataset(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = 1, label_smoothing_rate: float = 0.0)[source]

Bases: torch.utils.data.Dataset

Triple Dataset

D:= {(x)_i}_i ^N, where
. x:(h,r, t) in KG is a unique h in E and a relation r in R and . collact_fn => Generates negative triples

collect_fn:

orall (h,r,t) in G obtain, create negative triples{(h,r,x),(,r,t),(h,m,t)}

y:labels are represented in torch.float16

train_set_idx
Indexed triples for the training.

entity_idxs
mapping.

relation_idxs
mapping.

form
?

store
?

label_smoothing_rate

collate_fn: batch:List[torch.IntTensor] Returns ——- torch.utils.data.Dataset

label_smoothing_rate

neg_sample_ratio

train_set

length

num_entities

num_relations

__len__()[source]

__getitem__(idx)[source]

collate_fn(batch: List[torch.Tensor])[source]

class dicee.dataset_classes.CVDataModule(train_set_idx: numpy.ndarray, num_entities, num_relations, neg_sample_ratio, batch_size, num_workers)[source]

Bases: pytorch_lightning.LightningDataModule

Create a Dataset for cross validation

Parameters:

train_set_idx – Indexed triples for the training.
num_entities – entity to index mapping.
num_relations – relation to index mapping.
batch_size – int
form –
?
num_workers – int for https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

Return type:

?

train_set_idx

num_entities

num_relations

neg_sample_ratio

batch_size

num_workers

train_dataloader() → torch.utils.data.DataLoader[source]

An iterable or collection of iterables specifying training samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

download in prepare_data()

process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

fit()
prepare_data()
setup()

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

setup(*args, **kwargs)[source]

Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.

Parameters:: stage – either 'fit', 'validate', 'test', or 'predict'

Example:

class LitModel(...):
    def __init__(self):
        self.l1 = None

    def prepare_data(self):
        download_data()
        tokenize()

        # don't do this
        self.something = else

    def setup(self, stage):
        data = load_data(...)
        self.l1 = nn.Linear(28, data.num_classes)

transfer_batch_to_device(*args, **kwargs)[source]

Override this hook if your DataLoader returns tensors wrapped in a custom data structure.

The data types listed below (and any arbitrary nesting of them) are supported out of the box:

torch.Tensor or anything that implements .to(…)
list
dict
tuple

For anything else, you need to define how the data is moved to the target device (CPU, GPU, TPU, …).

Note

This hook should only transfer the data and not modify it, nor should it move the data to any other device than the one passed in as argument (unless you know what you are doing). To check the current state of execution of this hook you can use self.trainer.training/testing/validating/predicting so that you can add different logic as per your requirement.

Parameters:

batch – A batch of data that needs to be transferred to a new device.
device – The target device as defined in PyTorch.
dataloader_idx – The index of the dataloader to which the batch belongs.

Returns:

A reference to the data on the new device.

Example:

def transfer_batch_to_device(self, batch, device, dataloader_idx):
    if isinstance(batch, CustomBatch):
        # move all tensors in your custom data structure to the device
        batch.samples = batch.samples.to(device)
        batch.targets = batch.targets.to(device)
    elif dataloader_idx == 0:
        # skip device transfer for the first dataloader or anything you wish
        pass
    else:
        batch = super().transfer_batch_to_device(batch, device, dataloader_idx)
    return batch