dicee.dataset_classes
Classes
An abstract class representing a |
|
An abstract class representing a |
|
Dataset for the 1vsALL training strategy |
|
Dataset for the 1vsALL training strategy |
|
Creates a dataset for KvsAll training by inheriting from torch.utils.data.Dataset. |
|
Creates a dataset for AllvsAll training by inheriting from torch.utils.data.Dataset. |
|
A custom PyTorch Dataset class for knowledge graph embeddings, which includes |
|
KvsSample a Dataset: |
|
An abstract class representing a |
|
Triple Dataset |
|
Create a Dataset for cross validation |
Functions
|
Reload the files from disk to construct the Pytorch dataset |
|
Module Contents
- dicee.dataset_classes.reload_dataset(path: str, form_of_labelling, scoring_technique, neg_ratio, label_smoothing_rate)[source]
Reload the files from disk to construct the Pytorch dataset
- dicee.dataset_classes.construct_dataset(*, train_set: numpy.ndarray | list, valid_set=None, test_set=None, ordered_bpe_entities=None, train_target_indices=None, target_dim: int = None, entity_to_idx: dict, relation_to_idx: dict, form_of_labelling: str, scoring_technique: str, neg_ratio: int, label_smoothing_rate: float, byte_pair_encoding=None, block_size: int = None) torch.utils.data.Dataset [source]
- class dicee.dataset_classes.BPE_NegativeSamplingDataset(train_set: torch.LongTensor, ordered_shaped_bpe_entities: torch.LongTensor, neg_ratio: int)[source]
Bases:
torch.utils.data.Dataset
An abstract class representing a
Dataset
.All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite
__getitem__()
, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite__len__()
, which is expected to return the size of the dataset by manySampler
implementations and the default options ofDataLoader
. Subclasses could also optionally implement__getitems__()
, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.Note
DataLoader
by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.- train_set
- ordered_bpe_entities
- num_bpe_entities
- neg_ratio
- num_datapoints
- class dicee.dataset_classes.MultiLabelDataset(train_set: torch.LongTensor, train_indices_target: torch.LongTensor, target_dim: int, torch_ordered_shaped_bpe_entities: torch.LongTensor)[source]
Bases:
torch.utils.data.Dataset
An abstract class representing a
Dataset
.All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite
__getitem__()
, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite__len__()
, which is expected to return the size of the dataset by manySampler
implementations and the default options ofDataLoader
. Subclasses could also optionally implement__getitems__()
, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.Note
DataLoader
by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.- train_set
- train_indices_target
- target_dim
- num_datapoints
- torch_ordered_shaped_bpe_entities
- collate_fn = None
- class dicee.dataset_classes.MultiClassClassificationDataset(subword_units: numpy.ndarray, block_size: int = 8)[source]
Bases:
torch.utils.data.Dataset
Dataset for the 1vsALL training strategy
- Parameters:
train_set_idx – Indexed triples for the training.
entity_idxs – mapping.
relation_idxs – mapping.
form –
?
num_workers – int for https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
- Return type:
torch.utils.data.Dataset
- train_data
- block_size = 8
- num_of_data_points
- collate_fn = None
- class dicee.dataset_classes.OnevsAllDataset(train_set_idx: numpy.ndarray, entity_idxs)[source]
Bases:
torch.utils.data.Dataset
Dataset for the 1vsALL training strategy
- Parameters:
train_set_idx – Indexed triples for the training.
entity_idxs – mapping.
relation_idxs – mapping.
form –
?
num_workers – int for https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
- Return type:
torch.utils.data.Dataset
- train_data
- target_dim
- collate_fn = None
- class dicee.dataset_classes.KvsAll(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, form, store=None, label_smoothing_rate: float = 0.0)[source]
Bases:
torch.utils.data.Dataset
- Creates a dataset for KvsAll training by inheriting from torch.utils.data.Dataset.
Let D denote a dataset for KvsAll training and be defined as D:= {(x,y)_i}_i ^N, where x: (h,r) is an unique tuple of an entity h in E and a relation r in R that has been seed in the input graph. y: denotes a multi-label vector in [0,1]^{|E|} is a binary label.
orall y_i =1 s.t. (h r E_i) in KG
Note
TODO
- train_set_idxnumpy.ndarray
n by 3 array representing n triples
- entity_idxsdictonary
string representation of an entity to its integer id
- relation_idxsdictonary
string representation of a relation to its integer id
self : torch.utils.data.Dataset
>>> a = KvsAll() >>> a ? array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
- train_data = None
- train_target = None
- label_smoothing_rate
- collate_fn = None
- class dicee.dataset_classes.AllvsAll(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, label_smoothing_rate=0.0)[source]
Bases:
torch.utils.data.Dataset
- Creates a dataset for AllvsAll training by inheriting from torch.utils.data.Dataset.
Let D denote a dataset for AllvsAll training and be defined as D:= {(x,y)_i}_i ^N, where x: (h,r) is a possible unique tuple of an entity h in E and a relation r in R. Hence N = |E| x |R| y: denotes a multi-label vector in [0,1]^{|E|} is a binary label.
orall y_i =1 s.t. (h r E_i) in KG
Note
- AllvsAll extends KvsAll via none existing (h,r). Hence, it adds data points that are labelled without 1s,
only with 0s.
- train_set_idxnumpy.ndarray
n by 3 array representing n triples
- entity_idxsdictonary
string representation of an entity to its integer id
- relation_idxsdictonary
string representation of a relation to its integer id
self : torch.utils.data.Dataset
>>> a = AllvsAll() >>> a ? array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
- train_data = None
- train_target = None
- label_smoothing_rate
- collate_fn = None
- target_dim
- class dicee.dataset_classes.OnevsSample(train_set: numpy.ndarray, num_entities, num_relations, neg_sample_ratio: int = None, label_smoothing_rate: float = 0.0)[source]
Bases:
torch.utils.data.Dataset
A custom PyTorch Dataset class for knowledge graph embeddings, which includes both positive and negative sampling for a given dataset for multi-class classification problem..
- Parameters:
train_set (np.ndarray) – A numpy array containing triples of knowledge graph data. Each triple consists of (head_entity, relation, tail_entity).
num_entities (int) – The number of unique entities in the knowledge graph.
num_relations (int) – The number of unique relations in the knowledge graph.
neg_sample_ratio (int, optional) – The number of negative samples to be generated per positive sample. Must be a positive integer and less than num_entities.
label_smoothing_rate (float, optional) – A label smoothing rate to apply to the positive and negative labels. Defaults to 0.0.
- train_data
The input data converted into a PyTorch tensor.
- Type:
torch.Tensor
- num_entities
Number of entities in the dataset.
- Type:
int
- num_relations
Number of relations in the dataset.
- Type:
int
- neg_sample_ratio
Ratio of negative samples to be drawn for each positive sample.
- Type:
int
- label_smoothing_rate
The smoothing factor applied to the labels.
- Type:
torch.Tensor
- collate_fn
A function that can be used to collate data samples into batches (set to None by default).
- Type:
function, optional
- train_data
- num_entities
- num_relations
- neg_sample_ratio = None
- label_smoothing_rate
- collate_fn = None
- __getitem__(idx)[source]
Retrieves a single data sample from the dataset at the given index.
- Parameters:
idx (int) – The index of the sample to retrieve.
- Returns:
- A tuple consisting of:
x (torch.Tensor): The head and relation part of the triple.
y_idx (torch.Tensor): The concatenated indices of the true object (tail entity) and the indices of the negative samples.
y_vec (torch.Tensor): A vector containing the labels for the positive and negative samples, with label smoothing applied.
- Return type:
tuple
- class dicee.dataset_classes.KvsSampleDataset(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, form, store=None, neg_ratio=None, label_smoothing_rate: float = 0.0)[source]
Bases:
torch.utils.data.Dataset
- KvsSample a Dataset:
- D:= {(x,y)_i}_i ^N, where
. x:(h,r) is a unique h in E and a relation r in R and . y in [0,1]^{|E|} is a binary label.
- orall y_i =1 s.t. (h r E_i) in KG
- train_set_idx
Indexed triples for the training.
- entity_idxs
mapping.
- relation_idxs
mapping.
- form
?
- store
?
- label_smoothing_rate
?
torch.utils.data.Dataset
- train_data = None
- train_target = None
- neg_ratio = None
- num_entities
- label_smoothing_rate
- collate_fn = None
- max_num_of_classes
- class dicee.dataset_classes.NegSampleDataset(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = 1)[source]
Bases:
torch.utils.data.Dataset
An abstract class representing a
Dataset
.All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite
__getitem__()
, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite__len__()
, which is expected to return the size of the dataset by manySampler
implementations and the default options ofDataLoader
. Subclasses could also optionally implement__getitems__()
, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.Note
DataLoader
by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.- neg_sample_ratio
- train_set
- length
- num_entities
- num_relations
- class dicee.dataset_classes.TriplePredictionDataset(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = 1, label_smoothing_rate: float = 0.0)[source]
Bases:
torch.utils.data.Dataset
Triple Dataset
- D:= {(x)_i}_i ^N, where
. x:(h,r, t) in KG is a unique h in E and a relation r in R and . collact_fn => Generates negative triples
collect_fn:
orall (h,r,t) in G obtain, create negative triples{(h,r,x),(,r,t),(h,m,t)}
y:labels are represented in torch.float16
- train_set_idx
Indexed triples for the training.
- entity_idxs
mapping.
- relation_idxs
mapping.
- form
?
- store
?
label_smoothing_rate
collate_fn: batch:List[torch.IntTensor] Returns ——- torch.utils.data.Dataset
- label_smoothing_rate
- neg_sample_ratio
- train_set
- length
- num_entities
- num_relations
- class dicee.dataset_classes.CVDataModule(train_set_idx: numpy.ndarray, num_entities, num_relations, neg_sample_ratio, batch_size, num_workers)[source]
Bases:
pytorch_lightning.LightningDataModule
Create a Dataset for cross validation
- Parameters:
train_set_idx – Indexed triples for the training.
num_entities – entity to index mapping.
num_relations – relation to index mapping.
batch_size – int
form –
?
num_workers – int for https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
- Return type:
?
- train_set_idx
- num_entities
- num_relations
- neg_sample_ratio
- batch_size
- num_workers
- train_dataloader() torch.utils.data.DataLoader [source]
An iterable or collection of iterables specifying training samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
For data processing use the following pattern:
download in
prepare_data()
process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- setup(*args, **kwargs)[source]
Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.
- Parameters:
stage – either
'fit'
,'validate'
,'test'
, or'predict'
Example:
class LitModel(...): def __init__(self): self.l1 = None def prepare_data(self): download_data() tokenize() # don't do this self.something = else def setup(self, stage): data = load_data(...) self.l1 = nn.Linear(28, data.num_classes)
- transfer_batch_to_device(*args, **kwargs)[source]
Override this hook if your
DataLoader
returns tensors wrapped in a custom data structure.The data types listed below (and any arbitrary nesting of them) are supported out of the box:
torch.Tensor
or anything that implements .to(…)list
dict
tuple
For anything else, you need to define how the data is moved to the target device (CPU, GPU, TPU, …).
Note
This hook should only transfer the data and not modify it, nor should it move the data to any other device than the one passed in as argument (unless you know what you are doing). To check the current state of execution of this hook you can use
self.trainer.training/testing/validating/predicting
so that you can add different logic as per your requirement.- Parameters:
batch – A batch of data that needs to be transferred to a new device.
device – The target device as defined in PyTorch.
dataloader_idx – The index of the dataloader to which the batch belongs.
- Returns:
A reference to the data on the new device.
Example:
def transfer_batch_to_device(self, batch, device, dataloader_idx): if isinstance(batch, CustomBatch): # move all tensors in your custom data structure to the device batch.samples = batch.samples.to(device) batch.targets = batch.targets.to(device) elif dataloader_idx == 0: # skip device transfer for the first dataloader or anything you wish pass else: batch = super().transfer_batch_to_device(batch, device, dataloader_idx) return batch
See also
move_data_to_device()
apply_to_collection()
- prepare_data(*args, **kwargs)[source]
Use this to download and prepare data. Downloading and saving data with multiple processes (distributed settings) will result in corrupted data. Lightning ensures this method is called only within a single process, so you can safely add your downloading logic within.
Warning
DO NOT set state to the model (use
setup
instead) since this is NOT called on every deviceExample:
def prepare_data(self): # good download_data() tokenize() etc() # bad self.split = data_split self.some_state = some_other_state()
In a distributed environment,
prepare_data
can be called in two ways (using prepare_data_per_node)Once per node. This is the default and is only called on LOCAL_RANK=0.
Once in total. Only called on GLOBAL_RANK=0.
Example:
# DEFAULT # called once per node on LOCAL_RANK=0 of that node class LitDataModule(LightningDataModule): def __init__(self): super().__init__() self.prepare_data_per_node = True # call on GLOBAL_RANK=0 (great for shared file systems) class LitDataModule(LightningDataModule): def __init__(self): super().__init__() self.prepare_data_per_node = False
This is called before requesting the dataloaders:
model.prepare_data() initialize_distributed() model.setup(stage) model.train_dataloader() model.val_dataloader() model.test_dataloader() model.predict_dataloader()