dicee.dataset_classes

Classes

BPE_NegativeSamplingDataset

An abstract class representing a Dataset.

MultiLabelDataset

An abstract class representing a Dataset.

MultiClassClassificationDataset

Dataset for the 1vsALL training strategy

OnevsAllDataset

Dataset for the 1vsALL training strategy

KvsAll

Creates a dataset for KvsAll training by inheriting from torch.utils.data.Dataset.

AllvsAll

Creates a dataset for AllvsAll training by inheriting from torch.utils.data.Dataset.

OnevsSample

A custom PyTorch Dataset class for knowledge graph embeddings, which includes

KvsSampleDataset

KvsSample a Dataset:

NegSampleDataset

An abstract class representing a Dataset.

TriplePredictionDataset

Triple Dataset

CVDataModule

Create a Dataset for cross validation

Functions

reload_dataset(path, form_of_labelling, ...)

Reload the files from disk to construct the Pytorch dataset

construct_dataset(→ torch.utils.data.Dataset)

Module Contents

dicee.dataset_classes.reload_dataset(path: str, form_of_labelling, scoring_technique, neg_ratio, label_smoothing_rate)[source]

Reload the files from disk to construct the Pytorch dataset

dicee.dataset_classes.construct_dataset(*, train_set: numpy.ndarray | list, valid_set=None, test_set=None, ordered_bpe_entities=None, train_target_indices=None, target_dim: int = None, entity_to_idx: dict, relation_to_idx: dict, form_of_labelling: str, scoring_technique: str, neg_ratio: int, label_smoothing_rate: float, byte_pair_encoding=None, block_size: int = None) torch.utils.data.Dataset[source]
class dicee.dataset_classes.BPE_NegativeSamplingDataset(train_set: torch.LongTensor, ordered_shaped_bpe_entities: torch.LongTensor, neg_ratio: int)[source]

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader. Subclasses could also optionally implement __getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

Note

DataLoader by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

train_set
ordered_bpe_entities
num_bpe_entities
neg_ratio
num_datapoints
__len__()[source]
__getitem__(idx)[source]
collate_fn(batch_shaped_bpe_triples: List[Tuple[torch.Tensor, torch.Tensor]])[source]
class dicee.dataset_classes.MultiLabelDataset(train_set: torch.LongTensor, train_indices_target: torch.LongTensor, target_dim: int, torch_ordered_shaped_bpe_entities: torch.LongTensor)[source]

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader. Subclasses could also optionally implement __getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

Note

DataLoader by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

train_set
train_indices_target
target_dim
num_datapoints
torch_ordered_shaped_bpe_entities
collate_fn = None
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.MultiClassClassificationDataset(subword_units: numpy.ndarray, block_size: int = 8)[source]

Bases: torch.utils.data.Dataset

Dataset for the 1vsALL training strategy

Parameters:
Return type:

torch.utils.data.Dataset

train_data
block_size
num_of_data_points
collate_fn = None
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.OnevsAllDataset(train_set_idx: numpy.ndarray, entity_idxs)[source]

Bases: torch.utils.data.Dataset

Dataset for the 1vsALL training strategy

Parameters:
Return type:

torch.utils.data.Dataset

train_data
target_dim
collate_fn = None
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.KvsAll(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, form, store=None, label_smoothing_rate: float = 0.0)[source]

Bases: torch.utils.data.Dataset

Creates a dataset for KvsAll training by inheriting from torch.utils.data.Dataset.

Let D denote a dataset for KvsAll training and be defined as D:= {(x,y)_i}_i ^N, where x: (h,r) is an unique tuple of an entity h in E and a relation r in R that has been seed in the input graph. y: denotes a multi-label vector in [0,1]^{|E|} is a binary label.

orall y_i =1 s.t. (h r E_i) in KG

Note

TODO

train_set_idxnumpy.ndarray

n by 3 array representing n triples

entity_idxsdictonary

string representation of an entity to its integer id

relation_idxsdictonary

string representation of a relation to its integer id

self : torch.utils.data.Dataset

>>> a = KvsAll()
>>> a
? array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
train_data = None
train_target = None
label_smoothing_rate
collate_fn = None
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.AllvsAll(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, label_smoothing_rate=0.0)[source]

Bases: torch.utils.data.Dataset

Creates a dataset for AllvsAll training by inheriting from torch.utils.data.Dataset.

Let D denote a dataset for AllvsAll training and be defined as D:= {(x,y)_i}_i ^N, where x: (h,r) is a possible unique tuple of an entity h in E and a relation r in R. Hence N = |E| x |R| y: denotes a multi-label vector in [0,1]^{|E|} is a binary label.

orall y_i =1 s.t. (h r E_i) in KG

Note

AllvsAll extends KvsAll via none existing (h,r). Hence, it adds data points that are labelled without 1s,

only with 0s.

train_set_idxnumpy.ndarray

n by 3 array representing n triples

entity_idxsdictonary

string representation of an entity to its integer id

relation_idxsdictonary

string representation of a relation to its integer id

self : torch.utils.data.Dataset

>>> a = AllvsAll()
>>> a
? array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
train_data = None
train_target = None
label_smoothing_rate
collate_fn = None
target_dim
store
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.OnevsSample(train_set: numpy.ndarray, num_entities, num_relations, neg_sample_ratio: int = None, label_smoothing_rate: float = 0.0)[source]

Bases: torch.utils.data.Dataset

A custom PyTorch Dataset class for knowledge graph embeddings, which includes both positive and negative sampling for a given dataset for multi-class classification problem..

Parameters:
  • train_set (np.ndarray) – A numpy array containing triples of knowledge graph data. Each triple consists of (head_entity, relation, tail_entity).

  • num_entities (int) – The number of unique entities in the knowledge graph.

  • num_relations (int) – The number of unique relations in the knowledge graph.

  • neg_sample_ratio (int, optional) – The number of negative samples to be generated per positive sample. Must be a positive integer and less than num_entities.

  • label_smoothing_rate (float, optional) – A label smoothing rate to apply to the positive and negative labels. Defaults to 0.0.

train_data

The input data converted into a PyTorch tensor.

Type:

torch.Tensor

num_entities

Number of entities in the dataset.

Type:

int

num_relations

Number of relations in the dataset.

Type:

int

neg_sample_ratio

Ratio of negative samples to be drawn for each positive sample.

Type:

int

label_smoothing_rate

The smoothing factor applied to the labels.

Type:

torch.Tensor

collate_fn

A function that can be used to collate data samples into batches (set to None by default).

Type:

function, optional

train_data
num_entities
num_relations
neg_sample_ratio
label_smoothing_rate
collate_fn = None
__len__()[source]

Returns the number of samples in the dataset.

__getitem__(idx)[source]

Retrieves a single data sample from the dataset at the given index.

Parameters:

idx (int) – The index of the sample to retrieve.

Returns:

A tuple consisting of:
  • x (torch.Tensor): The head and relation part of the triple.

  • y_idx (torch.Tensor): The concatenated indices of the true object (tail entity) and the indices of the negative samples.

  • y_vec (torch.Tensor): A vector containing the labels for the positive and negative samples, with label smoothing applied.

Return type:

tuple

class dicee.dataset_classes.KvsSampleDataset(train_set: numpy.ndarray, num_entities, num_relations, neg_sample_ratio: int = None, label_smoothing_rate: float = 0.0)[source]

Bases: torch.utils.data.Dataset

KvsSample a Dataset:
D:= {(x,y)_i}_i ^N, where

. x:(h,r) is a unique h in E and a relation r in R and . y in [0,1]^{|E|} is a binary label.

orall y_i =1 s.t. (h r E_i) in KG
At each mini-batch construction, we subsample(y), hence n

|new_y| << |E| new_y contains all 1’s if sum(y)< neg_sample ratio new_y contains

train_set_idx

Indexed triples for the training.

entity_idxs

mapping.

relation_idxs

mapping.

form

?

store

?

label_smoothing_rate

?

torch.utils.data.Dataset

train_data
num_entities
num_relations
neg_sample_ratio
label_smoothing_rate
collate_fn = None
store
train_target
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.NegSampleDataset(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = 1)[source]

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader. Subclasses could also optionally implement __getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

Note

DataLoader by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

neg_sample_ratio
train_set
length
num_entities
num_relations
__len__()[source]
__getitem__(idx)[source]
class dicee.dataset_classes.TriplePredictionDataset(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = 1, label_smoothing_rate: float = 0.0)[source]

Bases: torch.utils.data.Dataset

Triple Dataset

D:= {(x)_i}_i ^N, where

. x:(h,r, t) in KG is a unique h in E and a relation r in R and . collact_fn => Generates negative triples

collect_fn:

orall (h,r,t) in G obtain, create negative triples{(h,r,x),(,r,t),(h,m,t)}

y:labels are represented in torch.float16

train_set_idx

Indexed triples for the training.

entity_idxs

mapping.

relation_idxs

mapping.

form

?

store

?

label_smoothing_rate

collate_fn: batch:List[torch.IntTensor] Returns ——- torch.utils.data.Dataset

label_smoothing_rate
neg_sample_ratio
train_set
length
num_entities
num_relations
__len__()[source]
__getitem__(idx)[source]
collate_fn(batch: List[torch.Tensor])[source]
class dicee.dataset_classes.CVDataModule(train_set_idx: numpy.ndarray, num_entities, num_relations, neg_sample_ratio, batch_size, num_workers)[source]

Bases: pytorch_lightning.LightningDataModule

Create a Dataset for cross validation

Parameters:
Return type:

?

train_set_idx
num_entities
num_relations
neg_sample_ratio
batch_size
num_workers
train_dataloader() torch.utils.data.DataLoader[source]

An iterable or collection of iterables specifying training samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

setup(*args, **kwargs)[source]

Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.

Parameters:

stage – either 'fit', 'validate', 'test', or 'predict'

Example:

class LitModel(...):
    def __init__(self):
        self.l1 = None

    def prepare_data(self):
        download_data()
        tokenize()

        # don't do this
        self.something = else

    def setup(self, stage):
        data = load_data(...)
        self.l1 = nn.Linear(28, data.num_classes)
transfer_batch_to_device(*args, **kwargs)[source]

Override this hook if your DataLoader returns tensors wrapped in a custom data structure.

The data types listed below (and any arbitrary nesting of them) are supported out of the box:

  • torch.Tensor or anything that implements .to(…)

  • list

  • dict

  • tuple

For anything else, you need to define how the data is moved to the target device (CPU, GPU, TPU, …).

Note

This hook should only transfer the data and not modify it, nor should it move the data to any other device than the one passed in as argument (unless you know what you are doing). To check the current state of execution of this hook you can use self.trainer.training/testing/validating/predicting so that you can add different logic as per your requirement.

Parameters:
  • batch – A batch of data that needs to be transferred to a new device.

  • device – The target device as defined in PyTorch.

  • dataloader_idx – The index of the dataloader to which the batch belongs.

Returns:

A reference to the data on the new device.

Example:

def transfer_batch_to_device(self, batch, device, dataloader_idx):
    if isinstance(batch, CustomBatch):
        # move all tensors in your custom data structure to the device
        batch.samples = batch.samples.to(device)
        batch.targets = batch.targets.to(device)
    elif dataloader_idx == 0:
        # skip device transfer for the first dataloader or anything you wish
        pass
    else:
        batch = super().transfer_batch_to_device(batch, device, dataloader_idx)
    return batch

See also

  • move_data_to_device()

  • apply_to_collection()

prepare_data(*args, **kwargs)[source]

Use this to download and prepare data. Downloading and saving data with multiple processes (distributed settings) will result in corrupted data. Lightning ensures this method is called only within a single process, so you can safely add your downloading logic within.

Warning

DO NOT set state to the model (use setup instead) since this is NOT called on every device

Example:

def prepare_data(self):
    # good
    download_data()
    tokenize()
    etc()

    # bad
    self.split = data_split
    self.some_state = some_other_state()

In a distributed environment, prepare_data can be called in two ways (using prepare_data_per_node)

  1. Once per node. This is the default and is only called on LOCAL_RANK=0.

  2. Once in total. Only called on GLOBAL_RANK=0.

Example:

# DEFAULT
# called once per node on LOCAL_RANK=0 of that node
class LitDataModule(LightningDataModule):
    def __init__(self):
        super().__init__()
        self.prepare_data_per_node = True


# call on GLOBAL_RANK=0 (great for shared file systems)
class LitDataModule(LightningDataModule):
    def __init__(self):
        super().__init__()
        self.prepare_data_per_node = False

This is called before requesting the dataloaders:

model.prepare_data()
initialize_distributed()
model.setup(stage)
model.train_dataloader()
model.val_dataloader()
model.test_dataloader()
model.predict_dataloader()