dicee.dataset_classes
Classes
An abstract class representing a |
|
An abstract class representing a |
|
Dataset for the 1vsALL training strategy |
|
Dataset for the 1vsALL training strategy |
|
Creates a dataset for KvsAll training by inheriting from torch.utils.data.Dataset. |
|
Creates a dataset for AllvsAll training by inheriting from torch.utils.data.Dataset. |
|
A custom PyTorch Dataset class for knowledge graph embeddings, which includes |
|
KvsSample a Dataset: |
|
An abstract class representing a |
|
Triple Dataset |
|
Dataset for loading and processing literal data for training Literal Embedding model. |
Functions
|
Reload the files from disk to construct the Pytorch dataset |
|
Module Contents
- dicee.dataset_classes.reload_dataset(path: str, form_of_labelling, scoring_technique, neg_ratio, label_smoothing_rate)[source]
Reload the files from disk to construct the Pytorch dataset
- dicee.dataset_classes.construct_dataset(*, train_set: numpy.ndarray | list, valid_set=None, test_set=None, ordered_bpe_entities=None, train_target_indices=None, target_dim: int = None, entity_to_idx: dict, relation_to_idx: dict, form_of_labelling: str, scoring_technique: str, neg_ratio: int, label_smoothing_rate: float, byte_pair_encoding=None, block_size: int = None) torch.utils.data.Dataset[source]
- class dicee.dataset_classes.BPE_NegativeSamplingDataset(train_set: torch.LongTensor, ordered_shaped_bpe_entities: torch.LongTensor, neg_ratio: int)[source]
Bases:
torch.utils.data.DatasetAn abstract class representing a
Dataset.All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite
__getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite__len__(), which is expected to return the size of the dataset by manySamplerimplementations and the default options ofDataLoader. Subclasses could also optionally implement__getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.Note
DataLoaderby default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.- train_set
- ordered_bpe_entities
- num_bpe_entities
- neg_ratio
- num_datapoints
- class dicee.dataset_classes.MultiLabelDataset(train_set: torch.LongTensor, train_indices_target: torch.LongTensor, target_dim: int, torch_ordered_shaped_bpe_entities: torch.LongTensor)[source]
Bases:
torch.utils.data.DatasetAn abstract class representing a
Dataset.All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite
__getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite__len__(), which is expected to return the size of the dataset by manySamplerimplementations and the default options ofDataLoader. Subclasses could also optionally implement__getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.Note
DataLoaderby default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.- train_set
- train_indices_target
- target_dim
- num_datapoints
- torch_ordered_shaped_bpe_entities
- collate_fn = None
- class dicee.dataset_classes.MultiClassClassificationDataset(subword_units: numpy.ndarray, block_size: int = 8)[source]
Bases:
torch.utils.data.DatasetDataset for the 1vsALL training strategy
- Parameters:
train_set_idx – Indexed triples for the training.
entity_idxs – mapping.
relation_idxs – mapping.
form –
?
num_workers – int for https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
- Return type:
torch.utils.data.Dataset
- train_data
- block_size = 8
- num_of_data_points
- collate_fn = None
- class dicee.dataset_classes.OnevsAllDataset(train_set_idx: numpy.ndarray, entity_idxs)[source]
Bases:
torch.utils.data.DatasetDataset for the 1vsALL training strategy
- Parameters:
train_set_idx – Indexed triples for the training.
entity_idxs – mapping.
relation_idxs – mapping.
form –
?
num_workers – int for https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
- Return type:
torch.utils.data.Dataset
- train_data
- target_dim
- collate_fn = None
- class dicee.dataset_classes.KvsAll(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, form, store=None, label_smoothing_rate: float = 0.0)[source]
Bases:
torch.utils.data.Dataset- Creates a dataset for KvsAll training by inheriting from torch.utils.data.Dataset.
Let D denote a dataset for KvsAll training and be defined as D:= {(x,y)_i}_i ^N, where x: (h,r) is an unique tuple of an entity h in E and a relation r in R that has been seed in the input graph. y: denotes a multi-label vector in [0,1]^{|E|} is a binary label.
orall y_i =1 s.t. (h r E_i) in KG
Note
TODO
- train_set_idxnumpy.ndarray
n by 3 array representing n triples
- entity_idxsdictonary
string representation of an entity to its integer id
- relation_idxsdictonary
string representation of a relation to its integer id
self : torch.utils.data.Dataset
>>> a = KvsAll() >>> a ? array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
- train_data = None
- train_target = None
- label_smoothing_rate
- collate_fn = None
- class dicee.dataset_classes.AllvsAll(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, label_smoothing_rate=0.0)[source]
Bases:
torch.utils.data.Dataset- Creates a dataset for AllvsAll training by inheriting from torch.utils.data.Dataset.
Let D denote a dataset for AllvsAll training and be defined as D:= {(x,y)_i}_i ^N, where x: (h,r) is a possible unique tuple of an entity h in E and a relation r in R. Hence N = |E| x |R| y: denotes a multi-label vector in [0,1]^{|E|} is a binary label.
orall y_i =1 s.t. (h r E_i) in KG
Note
- AllvsAll extends KvsAll via none existing (h,r). Hence, it adds data points that are labelled without 1s,
only with 0s.
- train_set_idxnumpy.ndarray
n by 3 array representing n triples
- entity_idxsdictonary
string representation of an entity to its integer id
- relation_idxsdictonary
string representation of a relation to its integer id
self : torch.utils.data.Dataset
>>> a = AllvsAll() >>> a ? array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
- train_data = None
- train_target = None
- label_smoothing_rate
- collate_fn = None
- target_dim
- class dicee.dataset_classes.OnevsSample(train_set: numpy.ndarray, num_entities, num_relations, neg_sample_ratio: int = None, label_smoothing_rate: float = 0.0)[source]
Bases:
torch.utils.data.DatasetA custom PyTorch Dataset class for knowledge graph embeddings, which includes both positive and negative sampling for a given dataset for multi-class classification problem..
- Parameters:
train_set (np.ndarray) – A numpy array containing triples of knowledge graph data. Each triple consists of (head_entity, relation, tail_entity).
num_entities (int) – The number of unique entities in the knowledge graph.
num_relations (int) – The number of unique relations in the knowledge graph.
neg_sample_ratio (int, optional) – The number of negative samples to be generated per positive sample. Must be a positive integer and less than num_entities.
label_smoothing_rate (float, optional) – A label smoothing rate to apply to the positive and negative labels. Defaults to 0.0.
- train_data
The input data converted into a PyTorch tensor.
- Type:
torch.Tensor
- num_entities
Number of entities in the dataset.
- Type:
int
- num_relations
Number of relations in the dataset.
- Type:
int
- neg_sample_ratio
Ratio of negative samples to be drawn for each positive sample.
- Type:
int
- label_smoothing_rate
The smoothing factor applied to the labels.
- Type:
torch.Tensor
- collate_fn
A function that can be used to collate data samples into batches (set to None by default).
- Type:
function, optional
- train_data
- num_entities
- num_relations
- neg_sample_ratio = None
- label_smoothing_rate
- collate_fn = None
- __getitem__(idx)[source]
Retrieves a single data sample from the dataset at the given index.
- Parameters:
idx (int) – The index of the sample to retrieve.
- Returns:
- A tuple consisting of:
x (torch.Tensor): The head and relation part of the triple.
y_idx (torch.Tensor): The concatenated indices of the true object (tail entity) and the indices of the negative samples.
y_vec (torch.Tensor): A vector containing the labels for the positive and negative samples, with label smoothing applied.
- Return type:
tuple
- class dicee.dataset_classes.KvsSampleDataset(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, form, store=None, neg_ratio=None, label_smoothing_rate: float = 0.0)[source]
Bases:
torch.utils.data.Dataset- KvsSample a Dataset:
- D:= {(x,y)_i}_i ^N, where
. x:(h,r) is a unique h in E and a relation r in R and . y in [0,1]^{|E|} is a binary label.
- orall y_i =1 s.t. (h r E_i) in KG
- train_set_idx
Indexed triples for the training.
- entity_idxs
mapping.
- relation_idxs
mapping.
- form
?
- store
?
- label_smoothing_rate
?
torch.utils.data.Dataset
- train_data = None
- train_target = None
- neg_ratio = None
- num_entities
- label_smoothing_rate
- collate_fn = None
- max_num_of_classes
- class dicee.dataset_classes.NegSampleDataset(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = 1)[source]
Bases:
torch.utils.data.DatasetAn abstract class representing a
Dataset.All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite
__getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite__len__(), which is expected to return the size of the dataset by manySamplerimplementations and the default options ofDataLoader. Subclasses could also optionally implement__getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.Note
DataLoaderby default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.- neg_sample_ratio
- train_triples
- length
- num_entities
- num_relations
- labels
- train_set = []
- class dicee.dataset_classes.TriplePredictionDataset(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = 1, label_smoothing_rate: float = 0.0)[source]
Bases:
torch.utils.data.DatasetTriple Dataset
- D:= {(x)_i}_i ^N, where
. x:(h,r, t) in KG is a unique h in E and a relation r in R and . collact_fn => Generates negative triples
collect_fn:
orall (h,r,t) in G obtain, create negative triples{(h,r,x),(,r,t),(h,m,t)}
y:labels are represented in torch.float16
- train_set_idx
Indexed triples for the training.
- entity_idxs
mapping.
- relation_idxs
mapping.
- form
?
- store
?
label_smoothing_rate
collate_fn: batch:List[torch.IntTensor] Returns ——- torch.utils.data.Dataset
- label_smoothing_rate
- neg_sample_ratio
- train_set
- length
- num_entities
- num_relations
- class dicee.dataset_classes.LiteralDataset(file_path: str, ent_idx: dict = None, normalization_type: str = 'z-norm', sampling_ratio: float = None, loader_backend: str = 'pandas')[source]
Bases:
torch.utils.data.DatasetDataset for loading and processing literal data for training Literal Embedding model. This dataset handles the loading, normalization, and preparation of triples for training a literal embedding model.
Extends torch.utils.data.Dataset for supporting PyTorch dataloaders.
- train_file_path
Path to the training data file.
- Type:
str
- normalization
Type of normalization to apply (‘z-norm’, ‘min-max’, or None).
- Type:
str
- normalization_params
Parameters used for normalization.
- Type:
dict
- sampling_ratio
Fraction of the training set to use for ablations.
- Type:
float
- entity_to_idx
Mapping of entities to their indices.
- Type:
dict
- num_entities
Total number of entities.
- Type:
int
- data_property_to_idx
Mapping of data properties to their indices.
- Type:
dict
- num_data_properties
Total number of data properties.
- Type:
int
- loader_backend
Backend to use for loading data (‘pandas’ or ‘rdflib’).
- Type:
str
- train_file_path
- loader_backend = 'pandas'
- normalization_type = 'z-norm'
- normalization_params
- sampling_ratio = None
- entity_to_idx = None
- num_entities
- static load_and_validate_literal_data(file_path: str = None, loader_backend: str = 'pandas') pandas.DataFrame[source]
Loads and validates the literal data file. :param file_path: Path to the literal data file. :type file_path: str
- Returns:
DataFrame containing the loaded and validated data.
- Return type:
pd.DataFrame
- static denormalize(preds_norm, attributes, normalization_params) numpy.ndarray[source]
Denormalizes the predictions based on the normalization type.
Args: preds_norm (np.ndarray): Normalized predictions to be denormalized. attributes (list): List of attributes corresponding to the predictions. normalization_params (dict): Dictionary containing normalization parameters for each attribute.
- Returns:
Denormalized predictions.
- Return type:
np.ndarray