yzhao062 / pyod

A Python Library for Outlier and Anomaly Detection, Integrating Classical and Deep Learning Techniques
http://pyod.readthedocs.io
BSD 2-Clause "Simplified" License
8.6k stars 1.37k forks source link

VAE.fit(X,y), Providing Y results in calling Torch.to_device on a list. Y is not explicitly ignored for unsupervised. #591

Open wyler0 opened 4 months ago

wyler0 commented 4 months ago

Documentation states y_train is ignored when passed to an unsupervised model fit function. Does not seem to be the case. Based on the code it appears this affects all BaseDeepLearningDetector inheriting classes, but I have only confirmed on VAE.

I only found this as I upgraded from Pyod 1.X to 2.X, and my 1.X code was failing.

model = VAE()
X_train = np.array(...)
y_train = np.array(...)

model.fit(X_train, y_train)

Error:

  File "[...]/lib/python3.10/site-packages/pyod/models/base_dl.py", line 194, in fit
    self.train(train_loader)
  File "[...]/lib/python3.10/site-packages/pyod/models/base_dl.py", line 229, in train
    loss = self.training_forward(batch_data)
  File "[...]/lib/python3.10/site-packages/pyod/models/vae.py", line 246, in training_forward
    x = x.to(self.device)
AttributeError: 'list' object has no attribute 'to

Offending code:

    def training_forward(self, batch_data):
        x = batch_data
        x = x.to(self.device)
        self.optimizer.zero_grad()
        x_recon, z_mu, z_logvar = self.model(x)
        loss = self.criterion(x, x_recon, z_mu, z_logvar,
                              beta=self.beta, capacity=self.capacity)
        loss.backward()
        self.optimizer.step()
        return loss.item()

The batch_data is assigned in the BaseDeepLearningDetector as follows:

        if self.preprocessing:
            self.X_mean = np.mean(X, axis=0)
            self.X_std = np.std(X, axis=0)
            train_set = TorchDataset(X=X, y=y,
                                     mean=self.X_mean, std=self.X_std)
        else:
            train_set = TorchDataset(X=X, y=y)

      [...]

        train_loader = torch.utils.data.DataLoader(
            dataset=train_set, batch_size=self.batch_size,
            shuffle=True, drop_last=True)

And in train we do the following:

    def train(self, train_loader):
        """Train the deep learning model.

        Parameters
        ----------
        train_loader : torch.utils.data.DataLoader
            The data loader for training the model.
        """
        for epoch in tqdm.trange(self.epoch_num,
                                 desc=f'Training: ',
                                 disable=not self.verbose == 1):
            start_time = time.time()
            overall_loss = []
            for batch_data in train_loader:
                loss = self.training_forward(batch_data)

So, it seems the x = x.to(self.device) need to handle cases where X is comprised of data and labels, or Y needs to be explicitly ignored in the train_loader instantiation.