xiaoshengli / BOPF

Code for "Linear Time Complexity Time Series Classification with Bag-of-Pattern-Features"
GNU General Public License v3.0
2 stars 0 forks source link

Segmentation fault (core dumped) #1

Open AASHISHAG opened 4 years ago

AASHISHAG commented 4 years ago

Hello @xiaoshengli

Congratulations for the great work and thank you for making the code open-source.

I was trying to use your code to classify the data. The data is weather data with 14 features. The first column represents the label 0 or 1. But when I run the code, I am getting the following error. I ave attached my dataset for reference.

agarwal@LTLab.lan@wika:~/sah/BOPF$ ls sah/
sah_TEST.csv  sah_TRAIN.csv
agarwal@LTLab.lan@wika:~/sah/BOPF$ ./BOPF sah
dataset:sah
Segmentation fault (core dumped)

Kindly guide. data.zip

xiaoshengli commented 4 years ago

Hi Aashish,

Thank you for being interested in our work! I have checked you data and found it is not time series data. BOPF is designed for time series data so I consider it may not suitable for the data. More conceretly for the error, the length of each time sereis instance (feature number in this case) is too short, which will prevent the program from producing corret sliding window lengths for the algorithm.

Best regards, Xiaosheng Li

AASHISHAG commented 4 years ago

Hello @xiaoshengli ,

Thank you for your response. I am new to this domain. I am not sure if I converted the data correctly into Train and Test files.

Here is the original data: https://drive.google.com/open?id=1WBDlZr0CQNlNYxg8w0xE-0R-3-Nw5IVM

The data is the Meteorological Data for ~20 years. Example: weather, humidity etc. In total 14 features and the class is 0 or 1. Is this not a time-series data?

The class might still also be dependent on the previous days or may-be not. Not sure about it.

As I am new to this field, could you please guide, how to approach the problem. I have tried using LSTM, CNN, Autoencoders, and other Machine Learning approaches but got bad results. I mean ROC of 0.50.

I like the idea you that presented in the paper, extracting features as Bag-of-Words. So, I thought if that can be applied.

I would be grateful if you could advise.

xiaoshengli commented 4 years ago

Hi Aashish,

I have checked the data and it is not time series data. For time series data, the orders matter, while for the conventional data, we can swith the features and do not affect the data. An example of time series is the temperature records. Each instance in the dataset can be for a place. Each value in a instance is the temparature record values at a day. In this example, switching the values in a instances will change the data. As for the data you provide, since it is not ordered, I think we do not need to use LSTM or convolutional filter which are suitable for ordered data. May be you can try other classification algorithms like SVM or multi-layer percetron.

Best regards, Xiaosheng Li

AASHISHAG commented 4 years ago

Dear Xiaosheng,

Apologies, if the questions seem naive.

My data has features like Temperature, Humidity ordered from the last 20 years for a place, where each row represents a day.

               Temp      Weather      Humidity  ..     FeatureN   Label
Day 1          12        34           45        ..     34         0
Day 2          12.3      20           33        ..     37         1
Day 3           13.5     21           34        ..     45         0

.
.
Day N

1) The data is ordered Day1, Day2 .. Day N. 2) If I move Day 3 before Day 2, then the sequence would be changed.

Please refer to this link: https://drive.google.com/open?id=1WBDlZr0CQNlNYxg8w0xE-0R-3-Nw5IVM

I am not sure if my understanding is still incorrect.

Please help.

Regards, Aashish

xiaoshengli commented 4 years ago

Hi Aashish,

For time series data, the "feature order" matters. For example, in a time series dataset, each instance represents a place. The "features" in an instance are all temperatures. They are ordered chronologically. In your case, you may take each column values and put it in a row as an time series instance, but this may not work as there is not label for each row as in a time series dataset. So my suggestion is just use the conventional machine learning methods as SVM, random forrest and nerual network for the problem.

Best regards, Xiaosheng Li

AASHISHAG commented 4 years ago

Dear Xiaosheng,

Thank you for your reply.

Last question. Below is an image that shows the stock price for a stock for over several days.

The "features" in an instance are all temperatures. 1) In this, the features are OPEN, HIGH, LOW and CLOSE. Which are different features, not instances of one feature?

They are ordered chronologically. 2) Here the order of OPEN, HIGH, LOW and CLOSE doesn't matter.

Then why is this a time series dataset?

1_Mu4l0UJru3TKJC3-_kNdGA

Regards, Aashish Agarwal

xiaoshengli commented 4 years ago

Hi Aashish,

For a time series dataset, each instance in the dataset is a time series. For example, in the FaceFour dataset in the source code, each row is an outline of face of a certain person. The values in a row reflect the shape of a face, so changing the order in a row changes the information. In the stock example, if we regard each row as a instance, then it is not time series data as each row is not a time series. We can take each column as a time series instead but in this case, there is not label for each coulumn. For BOPF to work, each row should be a time series with label.

Best regards, Xiaosheng Li

AASHISHAG commented 4 years ago

Dear Xiaosheng,

I referred some examples on time series data. The data I had presented before are examples of Multivariate Time-Series, where each column can also be a time series. You can refer to this link: https://machinelearningmastery.com/time-series-datasets-for-machine-learning/

But it looks like for BOPF to work we need each row as a time-series. Is BOPF implementation you presented in the paper, only for Univariate Time Series problem?

Regards, Aashish Agarwal

xiaoshengli commented 4 years ago

Hi Aashish,

Yes, the current implementation is only for univariate time series.

Best regards, Xiaosheng Li

AASHISHAG commented 4 years ago

Dear Xiaosheng,

Thank you for your response. I understand what you have mentioned earlier. But I am kind of confused now. What I have understood until now, that for any data to be a time series data, that both rows and column data should be in a sequence, which is not in my case. Is my understanding correct?

Could you please have a look at this data and let me know:

  1. Is the following data is Time-Series data?
  2. Can we use LSTM/CNN on this data?

https://docs.google.com/spreadsheets/d/1-HHnZcy86RqVqqnwGHTEdIOqjVNvaCBdBxEDnWciPOU/edit#gid=0

Regards, Aashish

xiaoshengli commented 4 years ago

Hi Aashish,

Yes, this is time series data, but can not directly feed to BOPF. For BOPF to work, each row of the file should be a time series with label and each column represents a time point. As for the LSTM/CNN, it depends on the tasks. You can use them to form the models and algorithms for different tasks like classification, clustering and prediction. We need to ajust the model and program to correctly read the data. For classification. We need to have labels for each time series.

Best regards, Xiaosheng Li

AASHISHAG commented 4 years ago

Dear Xiaosheng,

I have one query. You had mentioned that the below data (Data - 1) is a time-series data, but why the data that I presented earlier is not a time-series. Isn't both similar?

You had mentioned earlier, that in Data 2 data won't we can switch the features/columns. Therefore Data 2 is not a Time-series data, but we can also switch the features/columns in Data 1 and it will have no impact on the data.

Please help, I am confused now.

Data 1

time y x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50 x51 x52 x53 x54 x55 x56 x57 x58 x59 x60 x61
5/1/99 0:00 0 0.37666549 -4.5964348 -4.0957558 13.4976875 -0.1188297 -20.669883 0.00073248 -0.0611137 -0.0599656 -0.0381894 0.87795097 -0.0529588 -13.306135 0.10106831 0.04179988 0.19990057 -2.3273288 -0.9441666 3.07519949 0.12315375 -0.1043342 -0.5707098 -9.7844557 0.35595997 15.8428194 -0.4519735 -0.105282 96 -134.27786 0.05872556 -0.0216449 9.36675497 0.00215065 -69.187583 4.23257091 -0.2252669 -0.1968722 -0.0724494 -0.1037319 -0.7207462 -5.4124363 76.6790421 -0.6327275 1351.63286 -0.6570955 -0.434947 -108.77597 0.08485609 10.2101816 11.2951549 29.9846242 10.0917214 0.05327915 -4.9364344 -24.590146 18.5154363 3.47340047 0.03344426 0.95321898 0.00607578 0
5/1/99 0:02 0 0.47572049 -4.5425018 -4.0183588 16.2306585 -0.1287327 -18.758079 0.00073248 -0.0611137 -0.0599656 -0.0381894 0.87327297 -0.0142438 -13.306135 0.10110831 0.04144688 0.30431257 -2.3406268 -0.9399936 3.07519949 0.12315375 -0.1043342 -0.5748608 -9.7844557 0.36015997 16.4916844 -0.4504505 -0.09243 96 -134.48019 0.05875856 -0.0045789 9.35021497 0.00214865 -68.585197 4.31148991 -0.2252669 -0.1968722 -0.0591034 -0.0838949 -0.7207462 -8.3432223 78.1815981 -0.6327275 1370.37895 -0.8756285 -1.125819 -108.84897 0.08514609 12.5343396 11.2907609 29.9846242 10.0958714 0.06280115 -4.9371794 -32.413266 22.7600653 2.68293347 0.03353626 1.09050198 0.00608278 0
5/1/99 0:04 1 0.36384849 -4.6813938 -4.3531468 14.1279975 -0.1386357 -17.836632 0.01080348 -0.0611137 -0.0300566 -0.0183524 1.00490997 0.06515015 -9.6195962 0.10114831 0.04109488 0.25283857 -2.3539248 -0.9358236 3.07519949 0.12315375 -0.1043342 -0.5790128 -9.7844557 0.36435597 15.9728854 -0.4489265 -0.097144 96 -133.94659 0.05879056 -0.0846579 9.03740897 0.00214765 -67.838187 4.80991391 -0.2252669 -0.1868012 -0.0486964 -0.0738229 -0.7207462 -1.0851663 79.6841541 -0.6327275 1368.12309 -0.0377755 -0.519541 -109.08658 0.08543609 18.5828926 11.2863659 29.9846242 10.1002654 0.07232215 -4.9379244 -34.183774 27.0046633 3.53748747 0.03362926 1.84053998 0.00608978 0

Data 2

date Y x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14
01-01-03 0 14.2 4.6 15 4 0.2 0 8 7.1 983.6 3.6 86 10.6 -0.8 -0.8
01-02-03 0 21.6 5.5 9 4 0.2 0 7.8 10.2 967.9 9.2 88 11.8 6.8 6
01-03-03 0 18.7 4.4 7.4 4 0 0 7.8 8.5 976.2 6.2 90 8.5 1.2 1

Regards, Aashish

xiaoshengli commented 4 years ago

Hi Aashish,

A time series is a sequence of ordered values. Earlier what I mean is that the dataset is not the time series data for BOPF. Because each of its row is not time series and each row does not contain label. If each row is a time series and we switch the columns, the data will change because the shape of time series is changed. For a conventional data set, we can change the order of features and does not change the data.

Best regards, Xiaosheng Li

AASHISHAG commented 4 years ago

Dear Xiaosheng,

Ok. So, what I have understood is that we can neither use BOPF for Data-1 nor for Data-2, because of the reasons you mentioned above.

Can I use LSTM or CNN for Data-1 (machine data) and Data-2 (weather data) for classification problem?

Please help. I have to submit a project at the university and I am totally confused.

Regards, Aashish

xiaoshengli commented 4 years ago

Hi Aashish,

Because your labels are on each day but not for each time series, I suggest not need to use LSTM or CNN, as each labeled instance(row) is not time series. I suggest to use conventional classifiers like SVM.

Best regards, Xiaosheng Li

AASHISHAG commented 4 years ago

Dear Xiaosheng,

I applied SVM and ML methods but got bad results.

I plotted the PACF graph for my temperature feature. It shows "temperature" is dependent on the temperature of the past 6 days.

PACF plot for temperature. image

If I now take only the temperature as my feature and transform it in such a way that every instance(row) will contain 6 features (temp for 6 different days), one for the current day and 5 for the past 5 days.

Is it a time series problem now? Correct? Can we apply BOPF on it? It's the same way you suggested me before, transforming the columns into rows, but I am doing it at every interval of 6.

  temp
day1 3.6
day2 9.2
day3 6.2
day4 -1.3
day5 -3.2
day6 -2.6
day7 -5.4
day8 -5.4
day9 -8
day10 -8.7

Transformed Data:

Temp Dayn Temp Dayn-1 Temp Dayn-2 Temp Dayn-3 Temp Dayn-4 Temp Dayn-5 y
-2.6 -3.2 -1.3 6.2 9.2 3.6 0
-5.4 -2.6 -3.2 -1.3 6.2 9.2 1
-5.4 -5.4 -2.6 -3.2 -1.3 6.2 1
-8 -5.4 -5.4 -2.6 -3.2 -1.3 0
-8.7 -8 -5.4 -5.4 -2.6 -3.2 1
-6.1 -8.7 -8 -5.4 -5.4 -2.6 1
-3.8 -6.1 -8.7 -8 -5.4 -5.4 1

Now every instance is a day in sequence and every feature is a temperature. Please guide.

xiaoshengli commented 4 years ago

Hi Aashish,

One problem is that the length of time series is too short. For BOPF to work well, the length needs to be greater than 20, which is very normal for time series data. Another problem is that the label of a time series is not accurate, because the original label is for a day.

Best regards, Xiaosheng Li

AASHISHAG commented 4 years ago

Dear Xiaosheng,

I understand your point.

Looking at my ACF plot, the feature from the above graph are correlated for over 20 days. But my PACF plot shows they are related for 6 previous days approximately.

How many lookback days should I consider ideally? Should I take days shown by ACF plot or by PACF plot?

Please guide.

xiaoshengli commented 4 years ago

Hi Aashish,

I think you can try different lengths and use the one that provides the best result.

Best regards, Xiaosheng Li