Freezing, exiting without error message

hosford42 commented 1 month ago

I'm new to your lib, and I'm running into an unexpected problem. When I call fit, the code freezes. I left it running with CPU at 100% for several hours and nothing had happened, not even a print to the command line or a log message, so I killed the process. I lowered both the population size and perf_trials down to 1 to see if maybe it was just slow reporting due to my config. Now, it pauses for about 30 seconds, and then exits abruptly without any message whatsoever, not even a traceback.

I'm a little stumped as to where to even start debugging this, so I'd appreciate recommendations. I've gotten it to run on a smaller data set w/o issue, and it doesn't appear to be running out of memory. The data set I'm working with is classification -- 41 columns for x and 33 classes encoded with one-hot for y, with 671,088 rows. Not small but not unreasonably large either. Could that be the source of my problem? Otherwise, I'm sure it's blatant user error of some sort.

Offending code snippet:

model = xcsf.XCS(
    x_dim=x.shape[-1],
    y_dim=y.shape[-1],
    n_actions=1,
    random_state=seed,
    max_trials=epochs * len(x_train),
    perf_trials=1,#len(x_train),
    pop_size=1,#5000,
    loss_func="log",
    e0=0.01,
    alpha=1,
    nu=5,
    beta=0.05,
    delta=0.1,
    theta_sub=400,
    theta_del=200,
    stateful=False,
    ea={
        "select_type": "roulette",
        "theta_ea": 200,
        "lambda": 2,
        "p_crossover": 0.8,
        "err_reduc": 1,
        "fit_reduc": 0.1,
        "subsumption": False,
        "pred_reset": False,
    },
    action={
        "type": "integer",
    },
    condition={
        "type": "tree_gp",
        "args": {
            "min_constant": 0,
            "max_constant": 1,
            "n_constants": 100,
            "init_depth": 5,
            "max_len": 10000,
        },
    },
)
callback = xcsf.EarlyStoppingCallback(
    monitor="val",
    patience=20000,
    restore_best=True,
    min_delta=0,
    start_from=0,
    verbose=True
)
print(model.json())  # This prints just before it freezes
model.fit(x_train, y_train, validation_data=validation_data, callbacks=[callback], verbose=True)
print("DONE")  # Never prints

dpaetzel commented 1 month ago

Off the top of my head: A population size of 5000 and ~600.000 data points means that in each iteration 5000 * 600_000 = 3_000_000_000 matching operations are performed and that seems to me to be quite a lot (or at least more than what is thrown at typically at this system). I'm not very familiar with the tree_gp code but evaluating 3 billion GP trees (i.e. 5000 trees each on 600.000 data points) could actually simply take a lot of time? Of course, this would not explain the abrupt exiting behaviour you see if you downsize the population.

Could you maybe extend your example with randomly generated input data that somewhat matches the shape of your actual input data?

rpreen commented 1 month ago

Yeah, a complete working example with some generated data, e.g., from sklearn's make_classification would be helpful.

One problem that I can see is that the predictions are fitting regression curves with linear least squares (NLMS) for a classification problem with log (cross-entropy) loss. The log loss computation requires that the sum of the predictions is 1, which wont be true for NLMS/RLS, since there is no softmax function currently added to them -- maybe this needs some extra input checking to warn about this.

There is a onehot loss function that will argmax the predictions and return a binary error - if you try this and it doesn't crash it will confirm it's the loss function that is the problem.

Adding a softmax to the NLMS/RLS outputs might be a nice new feature to add with a new parameter to specify whether it was regression or classification -- I'm not aware of any prior work actually using it in this way, but I don't see why it wouldn't work since it's similar to neural nets.

You could try classification the traditional XCS(F) way with a label encoding and set it up as a single-step RL problem with n_actions=33, y_dim=1 similar to the RMUX example where the reward is 1 if correct otherwise 0, and there you could have the integer actions subdivide the space.

So just to clarify:

n_actions=1, y_dim=33 removes the action component and configures it as traditional XCSF (match sets only; regression) but with the neural nets you can add a softmax on top to make it perform classification; the onehot loss function should also work for any of the predictions since it just argmaxs the predictions to predict a label and returns a binary error - note I don't think there are any published papers using this approach, it's just something I added that you can try;
n_actions=33, y_dim=1 will use action sets similar to traditional XCS, but you still get the computed prediction of the payoff with NLMS, etc. and you can do classification in an RL way, there are plenty of papers doing this.
Adding a softmax function to the NLMS/RLS outputs could be done but needs to be implemented as a new feature.

rpreen commented 1 month ago

It's definitely not the dataset size that is the problem, I have tested much bigger: although you definitely will want a much smaller perf_trials like 1000 or you will not see output.

Also, I recommend running this stuff in a script rather than a notebook since if it hard exits it will print a message to the stdout and notebooks seem to hide it.

If it's actually crashing or hanging it probably is something more than just the loss function, but you will definitely want to resolve that too. Really need a complete example where it crashes to say more.

hosford42 commented 1 month ago

Narrowed it down quite a bit: The problem is with the y data type. It has to be floating-point to avoid the error. Also, it only triggers the bug if validation data is passed.

I'm running directly on the command line, and there really is no error message, btw. If you still need it, I will try to get you some stand-alone code that reproduces the issue reliably later today. I really appreciate your help!

dpaetzel commented 1 month ago

iirc we had wrong types causing crashes without error messages/output before (e.g. if X or y had the wrong shape or something). I thought that was fixed, though

rpreen commented 1 month ago

It definitely checks the shapes now, but probably doesn't check the types... numpy floats or ints should work. Pretty sure the wiki says they need to be numpy arrays.

rpreen commented 1 month ago

It would be nice to know exactly what was causing it to crash.

I have just created a PR, which adds an extra check to the validation_data to make sure it's a tuple with 2 items since currently there are no checks before it attempts to access the two arrays.

The gateway functions that pass data from Python like fit() and predict() are all using const py::array_t<double> which should throw exceptions if presented with incorrect types [edit: I guess it doesn't actually throw exceptions but automatically converts to the correct type] and is the recommend way to pass numpy arrays, so I'd rather not change this to something like an object and have to do a bunch of manual checking in the pybind_wrapper.cpp.

If you can let me know in more detail exactly what caused the crash and the extra check I just mentioned isn't enough then I can try adding something to #136 otherwise I guess if everything is working now we can just close this issue and I'll merge that PR.

hosford42 commented 1 month ago

Minimal code for reproducing the bug:

import numpy as np
import pandas as pd
import xcsf
from sklearn.model_selection import train_test_split

# Download archive.zip from https://www.kaggle.com/datasets/subhajournal/iotintrusion
df = pd.read_csv('archive.zip')
y = df['label'].to_numpy()
x = df.drop('label', axis=1).to_numpy(np.float_)

y_lookup, y = np.unique(y, return_inverse=True)
y = np.equal(y[:, np.newaxis], np.arange(len(y_lookup))[np.newaxis, :])

print(f"{x.shape=} {x.dtype=} {type(x)=} {y.shape=} {y.dtype=} {type(y)=}")

model = xcsf.XCS(
    x_dim=x.shape[-1],
    y_dim=y.shape[-1],
    n_actions=1,
    random_state=0,
    loss_func="onehot",
    stateful=False,
)

x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=.2, random_state=0)
validation_data = (x_val, y_val)

model.fit(x_train, y_train,
          validation_data=validation_data,
          verbose=True)

print("DONE")  # Never prints

rpreen commented 1 month ago

In that code, you haven't specified the condition type, so it's using the default hyperrectangles which will assume the inputs are in the range [0,1] so you need to scale the features like:

from sklearn.preprocessing import MinMaxScaler                                                                                                                                         
scaler = MinMaxScaler(feature_range=(0, 1)) 
x = scaler.fit_transform(x)

With that modification, I ran it and it seems to complete fine with the errors slowly coming down, although I doubt those hyperparameters will be great and will need playing with. The mset metric that is displayed is the average number of classifiers in the population that match a sample, so if that is stuck at 0 (like it is without scaling) that's a sign you need to change the conditions or features.

With 46 features that's quite a lot for hyperrectangles so you may need a much bigger population size than the default 2000 to get any kind of decent error as well.

hosford42 commented 1 month ago

I actually just realized my mistake on the scaling. Good call. I'm now getting decent results when I cast y to np.float_ and adjust the scaling. It still exits w/o a message for me when I use the above code without casting to np.float_, though.

I'm under a time crunch right now, or I'd dive in and try to figure out a fix. It's good enough for me that I've got it working with the type cast, so there's no urgency to this issue anymore. Thanks again for your help, guys!

xcsf-dev / xcsf

Freezing, exiting without error message #135