dicee.dataset_classes ===================== .. py:module:: dicee.dataset_classes Classes ------- .. autoapisummary:: dicee.dataset_classes.BPE_NegativeSamplingDataset dicee.dataset_classes.MultiLabelDataset dicee.dataset_classes.MultiClassClassificationDataset dicee.dataset_classes.OnevsAllDataset dicee.dataset_classes.KvsAll dicee.dataset_classes.AllvsAll dicee.dataset_classes.OnevsSample dicee.dataset_classes.KvsSampleDataset dicee.dataset_classes.NegSampleDataset dicee.dataset_classes.TriplePredictionDataset dicee.dataset_classes.CVDataModule Functions --------- .. autoapisummary:: dicee.dataset_classes.reload_dataset dicee.dataset_classes.construct_dataset Module Contents --------------- .. py:function:: reload_dataset(path: str, form_of_labelling, scoring_technique, neg_ratio, label_smoothing_rate) Reload the files from disk to construct the Pytorch dataset .. py:function:: construct_dataset(*, train_set: Union[numpy.ndarray, list], valid_set=None, test_set=None, ordered_bpe_entities=None, train_target_indices=None, target_dim: int = None, entity_to_idx: dict, relation_to_idx: dict, form_of_labelling: str, scoring_technique: str, neg_ratio: int, label_smoothing_rate: float, byte_pair_encoding=None, block_size: int = None) -> torch.utils.data.Dataset .. py:class:: BPE_NegativeSamplingDataset(train_set: torch.LongTensor, ordered_shaped_bpe_entities: torch.LongTensor, neg_ratio: int) Bases: :py:obj:`torch.utils.data.Dataset` An abstract class representing a :class:`Dataset`. All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite :meth:`__len__`, which is expected to return the size of the dataset by many :class:`~torch.utils.data.Sampler` implementations and the default options of :class:`~torch.utils.data.DataLoader`. Subclasses could also optionally implement :meth:`__getitems__`, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples. .. note:: :class:`~torch.utils.data.DataLoader` by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided. .. py:attribute:: train_set .. py:attribute:: ordered_bpe_entities .. py:attribute:: num_bpe_entities .. py:attribute:: neg_ratio .. py:attribute:: num_datapoints .. py:method:: __len__() .. py:method:: __getitem__(idx) .. py:method:: collate_fn(batch_shaped_bpe_triples: List[Tuple[torch.Tensor, torch.Tensor]]) .. py:class:: MultiLabelDataset(train_set: torch.LongTensor, train_indices_target: torch.LongTensor, target_dim: int, torch_ordered_shaped_bpe_entities: torch.LongTensor) Bases: :py:obj:`torch.utils.data.Dataset` An abstract class representing a :class:`Dataset`. All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite :meth:`__len__`, which is expected to return the size of the dataset by many :class:`~torch.utils.data.Sampler` implementations and the default options of :class:`~torch.utils.data.DataLoader`. Subclasses could also optionally implement :meth:`__getitems__`, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples. .. note:: :class:`~torch.utils.data.DataLoader` by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided. .. py:attribute:: train_set .. py:attribute:: train_indices_target .. py:attribute:: target_dim .. py:attribute:: num_datapoints .. py:attribute:: torch_ordered_shaped_bpe_entities .. py:attribute:: collate_fn :value: None .. py:method:: __len__() .. py:method:: __getitem__(idx) .. py:class:: MultiClassClassificationDataset(subword_units: numpy.ndarray, block_size: int = 8) Bases: :py:obj:`torch.utils.data.Dataset` Dataset for the 1vsALL training strategy :param train_set_idx: Indexed triples for the training. :param entity_idxs: mapping. :param relation_idxs: mapping. :param form: ? :param num_workers: int for https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader :rtype: torch.utils.data.Dataset .. py:attribute:: train_data .. py:attribute:: block_size :value: 8 .. py:attribute:: num_of_data_points .. py:attribute:: collate_fn :value: None .. py:method:: __len__() .. py:method:: __getitem__(idx) .. py:class:: OnevsAllDataset(train_set_idx: numpy.ndarray, entity_idxs) Bases: :py:obj:`torch.utils.data.Dataset` Dataset for the 1vsALL training strategy :param train_set_idx: Indexed triples for the training. :param entity_idxs: mapping. :param relation_idxs: mapping. :param form: ? :param num_workers: int for https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader :rtype: torch.utils.data.Dataset .. py:attribute:: train_data .. py:attribute:: target_dim .. py:attribute:: collate_fn :value: None .. py:method:: __len__() .. py:method:: __getitem__(idx) .. py:class:: KvsAll(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, form, store=None, label_smoothing_rate: float = 0.0) Bases: :py:obj:`torch.utils.data.Dataset` Creates a dataset for KvsAll training by inheriting from torch.utils.data.Dataset. Let D denote a dataset for KvsAll training and be defined as D:= {(x,y)_i}_i ^N, where x: (h,r) is an unique tuple of an entity h \in E and a relation r \in R that has been seed in the input graph. y: denotes a multi-label vector \in [0,1]^{|E|} is a binary label. orall y_i =1 s.t. (h r E_i) \in KG .. note:: TODO Parameters ---------- train_set_idx : numpy.ndarray n by 3 array representing n triples entity_idxs : dictonary string representation of an entity to its integer id relation_idxs : dictonary string representation of a relation to its integer id Returns ------- self : torch.utils.data.Dataset See Also -------- Notes ----- Examples -------- >>> a = KvsAll() >>> a ? array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) .. py:attribute:: train_data :value: None .. py:attribute:: train_target :value: None .. py:attribute:: label_smoothing_rate .. py:attribute:: collate_fn :value: None .. py:method:: __len__() .. py:method:: __getitem__(idx) .. py:class:: AllvsAll(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, label_smoothing_rate=0.0) Bases: :py:obj:`torch.utils.data.Dataset` Creates a dataset for AllvsAll training by inheriting from torch.utils.data.Dataset. Let D denote a dataset for AllvsAll training and be defined as D:= {(x,y)_i}_i ^N, where x: (h,r) is a possible unique tuple of an entity h \in E and a relation r \in R. Hence N = |E| x |R| y: denotes a multi-label vector \in [0,1]^{|E|} is a binary label. orall y_i =1 s.t. (h r E_i) \in KG .. note:: AllvsAll extends KvsAll via none existing (h,r). Hence, it adds data points that are labelled without 1s, only with 0s. Parameters ---------- train_set_idx : numpy.ndarray n by 3 array representing n triples entity_idxs : dictonary string representation of an entity to its integer id relation_idxs : dictonary string representation of a relation to its integer id Returns ------- self : torch.utils.data.Dataset See Also -------- Notes ----- Examples -------- >>> a = AllvsAll() >>> a ? array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) .. py:attribute:: train_data :value: None .. py:attribute:: train_target :value: None .. py:attribute:: label_smoothing_rate .. py:attribute:: collate_fn :value: None .. py:attribute:: target_dim .. py:method:: __len__() .. py:method:: __getitem__(idx) .. py:class:: OnevsSample(train_set: numpy.ndarray, num_entities, num_relations, neg_sample_ratio: int = None, label_smoothing_rate: float = 0.0) Bases: :py:obj:`torch.utils.data.Dataset` A custom PyTorch Dataset class for knowledge graph embeddings, which includes both positive and negative sampling for a given dataset for multi-class classification problem.. :param train_set: A numpy array containing triples of knowledge graph data. Each triple consists of (head_entity, relation, tail_entity). :type train_set: np.ndarray :param num_entities: The number of unique entities in the knowledge graph. :type num_entities: int :param num_relations: The number of unique relations in the knowledge graph. :type num_relations: int :param neg_sample_ratio: The number of negative samples to be generated per positive sample. Must be a positive integer and less than num_entities. :type neg_sample_ratio: int, optional :param label_smoothing_rate: A label smoothing rate to apply to the positive and negative labels. Defaults to 0.0. :type label_smoothing_rate: float, optional .. attribute:: train_data The input data converted into a PyTorch tensor. :type: torch.Tensor .. attribute:: num_entities Number of entities in the dataset. :type: int .. attribute:: num_relations Number of relations in the dataset. :type: int .. attribute:: neg_sample_ratio Ratio of negative samples to be drawn for each positive sample. :type: int .. attribute:: label_smoothing_rate The smoothing factor applied to the labels. :type: torch.Tensor .. attribute:: collate_fn A function that can be used to collate data samples into batches (set to None by default). :type: function, optional .. py:attribute:: train_data .. py:attribute:: num_entities .. py:attribute:: num_relations .. py:attribute:: neg_sample_ratio :value: None .. py:attribute:: label_smoothing_rate .. py:attribute:: collate_fn :value: None .. py:method:: __len__() Returns the number of samples in the dataset. .. py:method:: __getitem__(idx) Retrieves a single data sample from the dataset at the given index. :param idx: The index of the sample to retrieve. :type idx: int :returns: A tuple consisting of: - x (torch.Tensor): The head and relation part of the triple. - y_idx (torch.Tensor): The concatenated indices of the true object (tail entity) and the indices of the negative samples. - y_vec (torch.Tensor): A vector containing the labels for the positive and negative samples, with label smoothing applied. :rtype: tuple .. py:class:: KvsSampleDataset(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, form, store=None, neg_ratio=None, label_smoothing_rate: float = 0.0) Bases: :py:obj:`torch.utils.data.Dataset` KvsSample a Dataset: D:= {(x,y)_i}_i ^N, where . x:(h,r) is a unique h \in E and a relation r \in R and . y \in [0,1]^{|E|} is a binary label. orall y_i =1 s.t. (h r E_i) \in KG At each mini-batch construction, we subsample(y), hence n |new_y| << |E| new_y contains all 1's if sum(y)< neg_sample ratio new_y contains Parameters ---------- train_set_idx Indexed triples for the training. entity_idxs mapping. relation_idxs mapping. form ? store ? label_smoothing_rate ? Returns ------- torch.utils.data.Dataset .. py:attribute:: train_data :value: None .. py:attribute:: train_target :value: None .. py:attribute:: neg_ratio :value: None .. py:attribute:: num_entities .. py:attribute:: label_smoothing_rate .. py:attribute:: collate_fn :value: None .. py:attribute:: max_num_of_classes .. py:method:: __len__() .. py:method:: __getitem__(idx) .. py:class:: NegSampleDataset(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = 1) Bases: :py:obj:`torch.utils.data.Dataset` An abstract class representing a :class:`Dataset`. All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite :meth:`__len__`, which is expected to return the size of the dataset by many :class:`~torch.utils.data.Sampler` implementations and the default options of :class:`~torch.utils.data.DataLoader`. Subclasses could also optionally implement :meth:`__getitems__`, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples. .. note:: :class:`~torch.utils.data.DataLoader` by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided. .. py:attribute:: neg_sample_ratio .. py:attribute:: train_set .. py:attribute:: length .. py:attribute:: num_entities .. py:attribute:: num_relations .. py:method:: __len__() .. py:method:: __getitem__(idx) .. py:class:: TriplePredictionDataset(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = 1, label_smoothing_rate: float = 0.0) Bases: :py:obj:`torch.utils.data.Dataset` Triple Dataset D:= {(x)_i}_i ^N, where . x:(h,r, t) \in KG is a unique h \in E and a relation r \in R and . collact_fn => Generates negative triples collect_fn: orall (h,r,t) \in G obtain, create negative triples{(h,r,x),(,r,t),(h,m,t)} y:labels are represented in torch.float16 Parameters ---------- train_set_idx Indexed triples for the training. entity_idxs mapping. relation_idxs mapping. form ? store ? label_smoothing_rate collate_fn: batch:List[torch.IntTensor] Returns ------- torch.utils.data.Dataset .. py:attribute:: label_smoothing_rate .. py:attribute:: neg_sample_ratio .. py:attribute:: train_set .. py:attribute:: length .. py:attribute:: num_entities .. py:attribute:: num_relations .. py:method:: __len__() .. py:method:: __getitem__(idx) .. py:method:: collate_fn(batch: List[torch.Tensor]) .. py:class:: CVDataModule(train_set_idx: numpy.ndarray, num_entities, num_relations, neg_sample_ratio, batch_size, num_workers) Bases: :py:obj:`pytorch_lightning.LightningDataModule` Create a Dataset for cross validation :param train_set_idx: Indexed triples for the training. :param num_entities: entity to index mapping. :param num_relations: relation to index mapping. :param batch_size: int :param form: ? :param num_workers: int for https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader :rtype: ? .. py:attribute:: train_set_idx .. py:attribute:: num_entities .. py:attribute:: num_relations .. py:attribute:: neg_sample_ratio .. py:attribute:: batch_size .. py:attribute:: num_workers .. py:method:: train_dataloader() -> torch.utils.data.DataLoader An iterable or collection of iterables specifying training samples. For more information about multiple dataloaders, see this :ref:`section `. The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer. For data processing use the following pattern: - download in :meth:`prepare_data` - process and split in :meth:`setup` However, the above are only necessary for distributed processing. .. warning:: do not assign state in prepare_data - :meth:`~pytorch_lightning.trainer.trainer.Trainer.fit` - :meth:`prepare_data` - :meth:`setup` .. note:: Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself. .. py:method:: setup(*args, **kwargs) Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP. :param stage: either ``'fit'``, ``'validate'``, ``'test'``, or ``'predict'`` Example:: class LitModel(...): def __init__(self): self.l1 = None def prepare_data(self): download_data() tokenize() # don't do this self.something = else def setup(self, stage): data = load_data(...) self.l1 = nn.Linear(28, data.num_classes) .. py:method:: transfer_batch_to_device(*args, **kwargs) Override this hook if your :class:`~torch.utils.data.DataLoader` returns tensors wrapped in a custom data structure. The data types listed below (and any arbitrary nesting of them) are supported out of the box: - :class:`torch.Tensor` or anything that implements `.to(...)` - :class:`list` - :class:`dict` - :class:`tuple` For anything else, you need to define how the data is moved to the target device (CPU, GPU, TPU, ...). .. note:: This hook should only transfer the data and not modify it, nor should it move the data to any other device than the one passed in as argument (unless you know what you are doing). To check the current state of execution of this hook you can use ``self.trainer.training/testing/validating/predicting`` so that you can add different logic as per your requirement. :param batch: A batch of data that needs to be transferred to a new device. :param device: The target device as defined in PyTorch. :param dataloader_idx: The index of the dataloader to which the batch belongs. :returns: A reference to the data on the new device. Example:: def transfer_batch_to_device(self, batch, device, dataloader_idx): if isinstance(batch, CustomBatch): # move all tensors in your custom data structure to the device batch.samples = batch.samples.to(device) batch.targets = batch.targets.to(device) elif dataloader_idx == 0: # skip device transfer for the first dataloader or anything you wish pass else: batch = super().transfer_batch_to_device(batch, device, dataloader_idx) return batch .. seealso:: - :meth:`move_data_to_device` - :meth:`apply_to_collection` .. py:method:: prepare_data(*args, **kwargs) Use this to download and prepare data. Downloading and saving data with multiple processes (distributed settings) will result in corrupted data. Lightning ensures this method is called only within a single process, so you can safely add your downloading logic within. .. warning:: DO NOT set state to the model (use ``setup`` instead) since this is NOT called on every device Example:: def prepare_data(self): # good download_data() tokenize() etc() # bad self.split = data_split self.some_state = some_other_state() In a distributed environment, ``prepare_data`` can be called in two ways (using :ref:`prepare_data_per_node`) 1. Once per node. This is the default and is only called on LOCAL_RANK=0. 2. Once in total. Only called on GLOBAL_RANK=0. Example:: # DEFAULT # called once per node on LOCAL_RANK=0 of that node class LitDataModule(LightningDataModule): def __init__(self): super().__init__() self.prepare_data_per_node = True # call on GLOBAL_RANK=0 (great for shared file systems) class LitDataModule(LightningDataModule): def __init__(self): super().__init__() self.prepare_data_per_node = False This is called before requesting the dataloaders: .. code-block:: python model.prepare_data() initialize_distributed() model.setup(stage) model.train_dataloader() model.val_dataloader() model.test_dataloader() model.predict_dataloader()