Unable to replicate dataset's labaling

AvivAlloni commented 2 years ago

Hello,

First of all, thank you for this great repo, I really loved your work.

I'm trying to run a POC with your pre-trained network on my data. When I try to verify my labeling code, I'm testing it on the dataset you have used (DecPre normalization) in order to replicate its labels, but I'm getting wrong labels. I read your paper and the benchmark dataset paper and used exactly the same methods:

Calculating Mid-Price: p
Calculating next k events Mid-Price mean (k = 10, 20, 30, 50, 100) : mk
Label the data if (mk - p) / p is higher (1) , stationary (2) or higher (3) then alpha, where alpha is 0.00002 (0.002%)

What am I doing wrong?

Thanks ahead

zcakhaa commented 2 years ago

Hi,

You can start labelling with 0 (lower), 1(stationary) and 2(higher) instead of 1, 2, 3, otherwise you need to subtract 1 before training. Also, the alpha does not need to be 0.002%. You need to adjust this value based on your dataset. You can check label balance with different alphas. To start, it may be better to have balanced dataset.

From: AvivAlloni @.> Sent: Tuesday, February 15, 2022 4:34 PM To: zcakhaa/DeepLOB-Deep-Convolutional-Neural-Networks-for-Limit-Order-Books @.> Cc: Subscribed @.***> Subject: [zcakhaa/DeepLOB-Deep-Convolutional-Neural-Networks-for-Limit-Order-Books] Unable to replicate dataset's labaling (Issue #20)

Hello,

First of all, thank you for this great repo, I really loved your work.

I'm trying to run a POC with your pre-trained network on my data. When I try to verify my labeling code, I'm testing it on the dataset you have used (DecPre normalization) in order to replicate its labels, but I'm getting wrong labels. I read your paper and the benchmark dataset paper and used exactly the same methods:

Calculating Mid-Price: p
Calculating next k events Mid-Price mean (k = 10, 20, 30, 50, 100) : mk
Label the data if (mk - p) / p is higher (1) , stationary (2) or higher (3) then alpha, where alpha is 0.00002 (0.002%)

What am I doing wrong?

Thanks ahead

— Reply to this email directly, view it on GitHubhttps://github.com/zcakhaa/DeepLOB-Deep-Convolutional-Neural-Networks-for-Limit-Order-Books/issues/20, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFN66TF6YNMDINPAXIHKH33U3J6B3ANCNFSM5OPBYHVA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you are subscribed to this thread.Message ID: @.***>

hrsmrtsingh commented 2 years ago

Hi, I am also having the same issue. I am using the FI-2010 dataset (DecPre No Auction). I am trying to generate the labels for each 'event' using the same steps as mentioned in the first post -

Calculating Mid-Price: p
Calculating next k events Mid-Price mean (k = 10, 20, 30, 50, 100) : mk
Label the data if (mk - p) / p is higher (1) , stationary (2) or higher (3) then alpha, where alpha is 0.00002 (0.002%)

(I understand the need of labelling it 0,1 and 2 instead of the 1,2 and 3 followed by the dataset and the changes in alpha values. ) I am using the same dataset - so alpha value must be what is mentioned in the FI-2010 paper (0.002%) and similarly the labelling convention is being kept the same. My objective is to just check the validity of the steps 1 - 3 as a label generation process. But, I am getting the wrong labels throughout.

What am I doing wrong here ?

zcakhaa commented 2 years ago

Hi,

It is a publicly dataset written by https://arxiv.org/abs/1705.03233/ If you have any question regarding the reproducibility of the labels, please consult him directly.

From: Har Simrat @.> Sent: Monday, February 21, 2022 3:38 PM To: zcakhaa/DeepLOB-Deep-Convolutional-Neural-Networks-for-Limit-Order-Books @.> Cc: Zihao Zhang @.>; Comment @.> Subject: Re: [zcakhaa/DeepLOB-Deep-Convolutional-Neural-Networks-for-Limit-Order-Books] Unable to replicate dataset's labaling (Issue #20)

Hi, I am also having the same issue. I am using the FI-2010 dataset (DecPre No Auction). I am trying to generate the labels for each 'event' using the same steps as mentioned in the first post -

Calculating Mid-Price: p
Calculating next k events Mid-Price mean (k = 10, 20, 30, 50, 100) : mk
Label the data if (mk - p) / p is higher (1) , stationary (2) or higher (3) then alpha, where alpha is 0.00002 (0.002%)

(I understand the need of labelling it 0,1 and 2 instead of the 1,2 and 3 followed by the dataset and the changes in alpha values. ) I am using the same dataset - so alpha value must be what is mentioned in the FI-2010 paper (0.002%) and similarly the labelling convention is being kept the same. My objective is to just check the validity of the steps 1 - 3 as a label generation process. But, I am getting the wrong labels throughout.

What am I doing wrong here ?

— Reply to this email directly, view it on GitHubhttps://github.com/zcakhaa/DeepLOB-Deep-Convolutional-Neural-Networks-for-Limit-Order-Books/issues/20#issuecomment-1047006131, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFN66TGEEKQPFUFRVSNWZSTU4JL6FANCNFSM5OPBYHVA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you commented.Message ID: @.***>

davidsblom commented 1 year ago

I also have a question regarding the labels used in the jupyter notebook examples. You seem to shift the Y indices. Is this the correct approach to have a higher value of k? It seems that you are then training a model which predicts whether the future midprice is lower/higher/stationary than a future price which seems unexpected.

zcakhaa / DeepLOB-Deep-Convolutional-Neural-Networks-for-Limit-Order-Books

Unable to replicate dataset's labaling #20