dicee.dataset_classes
=====================

.. py:module:: dicee.dataset_classes


Classes
-------

.. autoapisummary::

   dicee.dataset_classes.BPE_NegativeSamplingDataset
   dicee.dataset_classes.MultiLabelDataset
   dicee.dataset_classes.MultiClassClassificationDataset
   dicee.dataset_classes.OnevsAllDataset
   dicee.dataset_classes.KvsAll
   dicee.dataset_classes.AllvsAll
   dicee.dataset_classes.OnevsSample
   dicee.dataset_classes.KvsSampleDataset
   dicee.dataset_classes.NegSampleDataset
   dicee.dataset_classes.TriplePredictionDataset
   dicee.dataset_classes.CVDataModule


Functions
---------

.. autoapisummary::

   dicee.dataset_classes.reload_dataset
   dicee.dataset_classes.construct_dataset


Module Contents
---------------

.. py:function:: reload_dataset(path: str, form_of_labelling, scoring_technique, neg_ratio, label_smoothing_rate)

   Reload the files from disk to construct the Pytorch dataset


.. py:function:: construct_dataset(*, train_set: Union[numpy.ndarray, list], valid_set=None, test_set=None, ordered_bpe_entities=None, train_target_indices=None, target_dim: int = None, entity_to_idx: dict, relation_to_idx: dict, form_of_labelling: str, scoring_technique: str, neg_ratio: int, label_smoothing_rate: float, byte_pair_encoding=None, block_size: int = None) -> torch.utils.data.Dataset

.. py:class:: BPE_NegativeSamplingDataset(train_set: torch.LongTensor, ordered_shaped_bpe_entities: torch.LongTensor, neg_ratio: int)

   Bases: :py:obj:`torch.utils.data.Dataset`


   An abstract class representing a :class:`Dataset`.

   All datasets that represent a map from keys to data samples should subclass
   it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a
   data sample for a given key. Subclasses could also optionally overwrite
   :meth:`__len__`, which is expected to return the size of the dataset by many
   :class:`~torch.utils.data.Sampler` implementations and the default options
   of :class:`~torch.utils.data.DataLoader`. Subclasses could also
   optionally implement :meth:`__getitems__`, for speedup batched samples
   loading. This method accepts list of indices of samples of batch and returns
   list of samples.

   .. note::
     :class:`~torch.utils.data.DataLoader` by default constructs an index
     sampler that yields integral indices.  To make it work with a map-style
     dataset with non-integral indices/keys, a custom sampler must be provided.


   .. py:attribute:: train_set


   .. py:attribute:: ordered_bpe_entities


   .. py:attribute:: num_bpe_entities


   .. py:attribute:: neg_ratio


   .. py:attribute:: num_datapoints


   .. py:method:: __len__()


   .. py:method:: __getitem__(idx)


   .. py:method:: collate_fn(batch_shaped_bpe_triples: List[Tuple[torch.Tensor, torch.Tensor]])


.. py:class:: MultiLabelDataset(train_set: torch.LongTensor, train_indices_target: torch.LongTensor, target_dim: int, torch_ordered_shaped_bpe_entities: torch.LongTensor)

   Bases: :py:obj:`torch.utils.data.Dataset`


   An abstract class representing a :class:`Dataset`.

   All datasets that represent a map from keys to data samples should subclass
   it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a
   data sample for a given key. Subclasses could also optionally overwrite
   :meth:`__len__`, which is expected to return the size of the dataset by many
   :class:`~torch.utils.data.Sampler` implementations and the default options
   of :class:`~torch.utils.data.DataLoader`. Subclasses could also
   optionally implement :meth:`__getitems__`, for speedup batched samples
   loading. This method accepts list of indices of samples of batch and returns
   list of samples.

   .. note::
     :class:`~torch.utils.data.DataLoader` by default constructs an index
     sampler that yields integral indices.  To make it work with a map-style
     dataset with non-integral indices/keys, a custom sampler must be provided.


   .. py:attribute:: train_set


   .. py:attribute:: train_indices_target


   .. py:attribute:: target_dim


   .. py:attribute:: num_datapoints


   .. py:attribute:: torch_ordered_shaped_bpe_entities


   .. py:attribute:: collate_fn
      :value: None


   .. py:method:: __len__()


   .. py:method:: __getitem__(idx)


.. py:class:: MultiClassClassificationDataset(subword_units: numpy.ndarray, block_size: int = 8)

   Bases: :py:obj:`torch.utils.data.Dataset`


   Dataset for the 1vsALL training strategy

   :param train_set_idx: Indexed triples for the training.
   :param entity_idxs: mapping.
   :param relation_idxs: mapping.
   :param form: ?
   :param num_workers: int for https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

   :rtype: torch.utils.data.Dataset


   .. py:attribute:: train_data


   .. py:attribute:: block_size
      :value: 8


   .. py:attribute:: num_of_data_points


   .. py:attribute:: collate_fn
      :value: None


   .. py:method:: __len__()


   .. py:method:: __getitem__(idx)


.. py:class:: OnevsAllDataset(train_set_idx: numpy.ndarray, entity_idxs)

   Bases: :py:obj:`torch.utils.data.Dataset`


   Dataset for the 1vsALL training strategy

   :param train_set_idx: Indexed triples for the training.
   :param entity_idxs: mapping.
   :param relation_idxs: mapping.
   :param form: ?
   :param num_workers: int for https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

   :rtype: torch.utils.data.Dataset


   .. py:attribute:: train_data


   .. py:attribute:: target_dim


   .. py:attribute:: collate_fn
      :value: None


   .. py:method:: __len__()


   .. py:method:: __getitem__(idx)


.. py:class:: KvsAll(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, form, store=None, label_smoothing_rate: float = 0.0)

   Bases: :py:obj:`torch.utils.data.Dataset`


   Creates a dataset for KvsAll training by inheriting from torch.utils.data.Dataset.
       Let D denote a dataset for KvsAll training and be defined as D:= {(x,y)_i}_i ^N, where
       x: (h,r) is an unique tuple of an entity h \in E and a relation r \in R that has been seed in the input graph.
       y: denotes a multi-label vector \in [0,1]^{|E|} is a binary label.
   orall y_i =1 s.t. (h r E_i) \in KG

       .. note::
           TODO

       Parameters
       ----------
       train_set_idx : numpy.ndarray
           n by 3 array representing n triples

       entity_idxs : dictonary
           string representation of an entity to its integer id

       relation_idxs : dictonary
           string representation of a relation to its integer id

       Returns
       -------
       self : torch.utils.data.Dataset

       See Also
       --------

       Notes
       -----

       Examples
       --------
       >>> a = KvsAll()
       >>> a
       ? array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])


   .. py:attribute:: train_data
      :value: None


   .. py:attribute:: train_target
      :value: None


   .. py:attribute:: label_smoothing_rate


   .. py:attribute:: collate_fn
      :value: None


   .. py:method:: __len__()


   .. py:method:: __getitem__(idx)


.. py:class:: AllvsAll(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, label_smoothing_rate=0.0)

   Bases: :py:obj:`torch.utils.data.Dataset`


   Creates a dataset for AllvsAll training by inheriting from torch.utils.data.Dataset.
       Let D denote a dataset for AllvsAll training and be defined as D:= {(x,y)_i}_i ^N, where
       x: (h,r) is a possible unique tuple of an entity h \in E and a relation r \in R. Hence N = |E| x |R|
       y: denotes a multi-label vector \in [0,1]^{|E|} is a binary label.
   orall y_i =1 s.t. (h r E_i) \in KG

       .. note::
           AllvsAll extends KvsAll via none existing (h,r). Hence, it adds data points that are labelled without 1s,
            only with 0s.

       Parameters
       ----------
       train_set_idx : numpy.ndarray
           n by 3 array representing n triples

       entity_idxs : dictonary
           string representation of an entity to its integer id

       relation_idxs : dictonary
           string representation of a relation to its integer id

       Returns
       -------
       self : torch.utils.data.Dataset

       See Also
       --------

       Notes
       -----

       Examples
       --------
       >>> a = AllvsAll()
       >>> a
       ? array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])


   .. py:attribute:: train_data
      :value: None


   .. py:attribute:: train_target
      :value: None


   .. py:attribute:: label_smoothing_rate


   .. py:attribute:: collate_fn
      :value: None


   .. py:attribute:: target_dim


   .. py:method:: __len__()


   .. py:method:: __getitem__(idx)


.. py:class:: OnevsSample(train_set: numpy.ndarray, num_entities, num_relations, neg_sample_ratio: int = None, label_smoothing_rate: float = 0.0)

   Bases: :py:obj:`torch.utils.data.Dataset`


   A custom PyTorch Dataset class for knowledge graph embeddings, which includes
   both positive and negative sampling for a given dataset for multi-class classification problem..

   :param train_set: A numpy array containing triples of knowledge graph data.
                     Each triple consists of (head_entity, relation, tail_entity).
   :type train_set: np.ndarray
   :param num_entities: The number of unique entities in the knowledge graph.
   :type num_entities: int
   :param num_relations: The number of unique relations in the knowledge graph.
   :type num_relations: int
   :param neg_sample_ratio: The number of negative samples to be generated
                            per positive sample. Must be a positive integer and less than num_entities.
   :type neg_sample_ratio: int, optional
   :param label_smoothing_rate: A label smoothing rate to apply to the
                                positive and negative labels. Defaults to 0.0.
   :type label_smoothing_rate: float, optional

   .. attribute:: train_data

      The input data converted into a PyTorch tensor.

      :type: torch.Tensor

   .. attribute:: num_entities

      Number of entities in the dataset.

      :type: int

   .. attribute:: num_relations

      Number of relations in the dataset.

      :type: int

   .. attribute:: neg_sample_ratio

      Ratio of negative samples to be drawn for each positive sample.

      :type: int

   .. attribute:: label_smoothing_rate

      The smoothing factor applied to the labels.

      :type: torch.Tensor

   .. attribute:: collate_fn

      A function that can be used to collate data samples into
      batches (set to None by default).

      :type: function, optional


   .. py:attribute:: train_data


   .. py:attribute:: num_entities


   .. py:attribute:: num_relations


   .. py:attribute:: neg_sample_ratio
      :value: None


   .. py:attribute:: label_smoothing_rate


   .. py:attribute:: collate_fn
      :value: None


   .. py:method:: __len__()

      Returns the number of samples in the dataset.


   .. py:method:: __getitem__(idx)

      Retrieves a single data sample from the dataset at the given index.

      :param idx: The index of the sample to retrieve.
      :type idx: int

      :returns:

                A tuple consisting of:
                    - x (torch.Tensor): The head and relation part of the triple.
                    - y_idx (torch.Tensor): The concatenated indices of the true object (tail entity)
                      and the indices of the negative samples.
                    - y_vec (torch.Tensor): A vector containing the labels for the positive and
                      negative samples, with label smoothing applied.
      :rtype: tuple


.. py:class:: KvsSampleDataset(train_set_idx: numpy.ndarray, entity_idxs, relation_idxs, form, store=None, neg_ratio=None, label_smoothing_rate: float = 0.0)

   Bases: :py:obj:`torch.utils.data.Dataset`


       KvsSample a Dataset:
           D:= {(x,y)_i}_i ^N, where
               . x:(h,r) is a unique h \in E and a relation r \in R and
               . y \in [0,1]^{|E|} is a binary label.
   orall y_i =1 s.t. (h r E_i) \in KG
              At each mini-batch construction, we subsample(y), hence n
               |new_y| << |E|
               new_y contains all 1's if sum(y)< neg_sample ratio
               new_y contains
          Parameters
          ----------
          train_set_idx
              Indexed triples for the training.
          entity_idxs
              mapping.
          relation_idxs
              mapping.
          form
              ?
          store
               ?
          label_smoothing_rate
              ?
          Returns
          -------
          torch.utils.data.Dataset


   .. py:attribute:: train_data
      :value: None


   .. py:attribute:: train_target
      :value: None


   .. py:attribute:: neg_ratio
      :value: None


   .. py:attribute:: num_entities


   .. py:attribute:: label_smoothing_rate


   .. py:attribute:: collate_fn
      :value: None


   .. py:attribute:: max_num_of_classes


   .. py:method:: __len__()


   .. py:method:: __getitem__(idx)


.. py:class:: NegSampleDataset(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = 1)

   Bases: :py:obj:`torch.utils.data.Dataset`


   An abstract class representing a :class:`Dataset`.

   All datasets that represent a map from keys to data samples should subclass
   it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a
   data sample for a given key. Subclasses could also optionally overwrite
   :meth:`__len__`, which is expected to return the size of the dataset by many
   :class:`~torch.utils.data.Sampler` implementations and the default options
   of :class:`~torch.utils.data.DataLoader`. Subclasses could also
   optionally implement :meth:`__getitems__`, for speedup batched samples
   loading. This method accepts list of indices of samples of batch and returns
   list of samples.

   .. note::
     :class:`~torch.utils.data.DataLoader` by default constructs an index
     sampler that yields integral indices.  To make it work with a map-style
     dataset with non-integral indices/keys, a custom sampler must be provided.


   .. py:attribute:: neg_sample_ratio


   .. py:attribute:: train_set


   .. py:attribute:: length


   .. py:attribute:: num_entities


   .. py:attribute:: num_relations


   .. py:method:: __len__()


   .. py:method:: __getitem__(idx)


.. py:class:: TriplePredictionDataset(train_set: numpy.ndarray, num_entities: int, num_relations: int, neg_sample_ratio: int = 1, label_smoothing_rate: float = 0.0)

   Bases: :py:obj:`torch.utils.data.Dataset`


       Triple Dataset

           D:= {(x)_i}_i ^N, where
               . x:(h,r, t) \in KG is a unique h \in E and a relation r \in R and
               . collact_fn => Generates negative triples

           collect_fn:
   orall (h,r,t) \in G obtain, create negative triples{(h,r,x),(,r,t),(h,m,t)}

           y:labels are represented in torch.float16
          Parameters
          ----------
          train_set_idx
              Indexed triples for the training.
          entity_idxs
              mapping.
          relation_idxs
              mapping.
          form
              ?
          store
               ?
          label_smoothing_rate


          collate_fn: batch:List[torch.IntTensor]
          Returns
          -------
          torch.utils.data.Dataset


   .. py:attribute:: label_smoothing_rate


   .. py:attribute:: neg_sample_ratio


   .. py:attribute:: train_set


   .. py:attribute:: length


   .. py:attribute:: num_entities


   .. py:attribute:: num_relations


   .. py:method:: __len__()


   .. py:method:: __getitem__(idx)


   .. py:method:: collate_fn(batch: List[torch.Tensor])


.. py:class:: CVDataModule(train_set_idx: numpy.ndarray, num_entities, num_relations, neg_sample_ratio, batch_size, num_workers)

   Bases: :py:obj:`pytorch_lightning.LightningDataModule`


   Create a Dataset for cross validation

   :param train_set_idx: Indexed triples for the training.
   :param num_entities: entity to index mapping.
   :param num_relations: relation to index mapping.
   :param batch_size: int
   :param form: ?
   :param num_workers: int for https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

   :rtype: ?


   .. py:attribute:: train_set_idx


   .. py:attribute:: num_entities


   .. py:attribute:: num_relations


   .. py:attribute:: neg_sample_ratio


   .. py:attribute:: batch_size


   .. py:attribute:: num_workers


   .. py:method:: train_dataloader() -> torch.utils.data.DataLoader

      An iterable or collection of iterables specifying training samples.

      For more information about multiple dataloaders, see this :ref:`section <multiple-dataloaders>`.

      The dataloader you return will not be reloaded unless you set
      :paramref:`~pytorch_lightning.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to
      a positive integer.

      For data processing use the following pattern:

          - download in :meth:`prepare_data`
          - process and split in :meth:`setup`

      However, the above are only necessary for distributed processing.

      .. warning:: do not assign state in prepare_data

      - :meth:`~pytorch_lightning.trainer.trainer.Trainer.fit`
      - :meth:`prepare_data`
      - :meth:`setup`

      .. note::

         Lightning tries to add the correct sampler for distributed and arbitrary hardware.
         There is no need to set it yourself.


   .. py:method:: setup(*args, **kwargs)

      Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you
      need to build models dynamically or adjust something about them. This hook is called on every process when
      using DDP.

      :param stage: either ``'fit'``, ``'validate'``, ``'test'``, or ``'predict'``

      Example::

          class LitModel(...):
              def __init__(self):
                  self.l1 = None

              def prepare_data(self):
                  download_data()
                  tokenize()

                  # don't do this
                  self.something = else

              def setup(self, stage):
                  data = load_data(...)
                  self.l1 = nn.Linear(28, data.num_classes)


   .. py:method:: transfer_batch_to_device(*args, **kwargs)

      Override this hook if your :class:`~torch.utils.data.DataLoader` returns tensors wrapped in a custom data
      structure.

      The data types listed below (and any arbitrary nesting of them) are supported out of the box:

      - :class:`torch.Tensor` or anything that implements `.to(...)`
      - :class:`list`
      - :class:`dict`
      - :class:`tuple`

      For anything else, you need to define how the data is moved to the target device (CPU, GPU, TPU, ...).

      .. note::

         This hook should only transfer the data and not modify it, nor should it move the data to
         any other device than the one passed in as argument (unless you know what you are doing).
         To check the current state of execution of this hook you can use
         ``self.trainer.training/testing/validating/predicting`` so that you can
         add different logic as per your requirement.

      :param batch: A batch of data that needs to be transferred to a new device.
      :param device: The target device as defined in PyTorch.
      :param dataloader_idx: The index of the dataloader to which the batch belongs.

      :returns: A reference to the data on the new device.

      Example::

          def transfer_batch_to_device(self, batch, device, dataloader_idx):
              if isinstance(batch, CustomBatch):
                  # move all tensors in your custom data structure to the device
                  batch.samples = batch.samples.to(device)
                  batch.targets = batch.targets.to(device)
              elif dataloader_idx == 0:
                  # skip device transfer for the first dataloader or anything you wish
                  pass
              else:
                  batch = super().transfer_batch_to_device(batch, device, dataloader_idx)
              return batch

      .. seealso::

         - :meth:`move_data_to_device`
         - :meth:`apply_to_collection`


   .. py:method:: prepare_data(*args, **kwargs)

      Use this to download and prepare data. Downloading and saving data with multiple processes (distributed
      settings) will result in corrupted data. Lightning ensures this method is called only within a single process,
      so you can safely add your downloading logic within.

      .. warning:: DO NOT set state to the model (use ``setup`` instead)
          since this is NOT called on every device

      Example::

          def prepare_data(self):
              # good
              download_data()
              tokenize()
              etc()

              # bad
              self.split = data_split
              self.some_state = some_other_state()

      In a distributed environment, ``prepare_data`` can be called in two ways
      (using :ref:`prepare_data_per_node<common/lightning_module:prepare_data_per_node>`)

      1. Once per node. This is the default and is only called on LOCAL_RANK=0.
      2. Once in total. Only called on GLOBAL_RANK=0.

      Example::

          # DEFAULT
          # called once per node on LOCAL_RANK=0 of that node
          class LitDataModule(LightningDataModule):
              def __init__(self):
                  super().__init__()
                  self.prepare_data_per_node = True


          # call on GLOBAL_RANK=0 (great for shared file systems)
          class LitDataModule(LightningDataModule):
              def __init__(self):
                  super().__init__()
                  self.prepare_data_per_node = False

      This is called before requesting the dataloaders:

      .. code-block:: python

          model.prepare_data()
          initialize_distributed()
          model.setup(stage)
          model.train_dataloader()
          model.val_dataloader()
          model.test_dataloader()
          model.predict_dataloader()