xinychen / transdim

Machine learning for transportation data imputation and prediction.
https://transdim.github.io
MIT License
1.2k stars 299 forks source link

The difference in data #5

Open JKZuo opened 3 years ago

JKZuo commented 3 years ago

About the data set. Each data file has these three named data: tensor,random_tensor,random_matrix. What do these three stand for and is there any difference?

xinychen commented 3 years ago

Hello, thanks for this question! In each data folder, we give three data files:

Of course, you can remove both random_tensor.mat and random_matrix.mat and use the following codes instead:

import numpy as np

# Specify tensor size
M = 214 # Suppose 214 road segments
I = 61 # Suppose 61 days
J = 144 # Suppose 144 time slots per day

# Generate random matrix of size M-by-I
np.random.seed(1000) # Set random seed
random_matrix = np.random.rand(M, I)

# Or generate random tensor of size M-by-I-by-J
np.random.seed(1000) # Set random seed
random_tensor = np.random.rand(M, I, J)

Hope it can help you!

Best, Xinyu

cq70605 commented 2 years ago

您好,我现在手上有一份数据集(传感器采集的数据,存在缺失值),想尝试用LRC-TNN来试试填充缺失值的效果,但跑出来结果似乎有点问题。 `import pandas as pd from tqdm import tqdm import time

r = 0.2 print('Missing rate = {}'.format(r)) missing_rate = r

file_path = '' data_19111201984=pd.read_csv(file_path,encoding='gbk') data_19111201984=data_19111201984[data_19111201984.day.isin([9,10,11,12,13,14])] data_list = []

for day, day_df in tqdm(data_19111201984.groupby('day')): data_list.append([day_df['温度'].values.tolist()])

dense_tensor = np.array([ten2mat(np.array(data_list), 2)]) print(dense_tensor.shape) # (1, 1440, 6) (sensor_id,num of data for one day,6 days) dim1, dim2, dim3 = dense_tensor.shape np.random.seed(1000) sparse_tensor = dense_tensor * np.round(np.random.rand(dim1, dim2, dim3) + 0.5 - missing_rate) print(sparse_tensor.shape) start = time.time() alpha = np.ones(3) / 3 rho = 1e-4 theta = 30 epsilon = 1e-4 maxiter = 100 LRTC(dense_tensor, sparse_tensor, alpha, rho, theta, epsilon, maxiter) end = time.time() print('Running time: %d seconds'%(end - start)) print()`

输出结果是: `Missing rate = 0.2 100%|██████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 3009.55it/s] (1, 1440, 6) (1, 1440, 6) Total iteration: 2 Tolerance: 0.0 Imputation MAPE: 1.0 Imputation RMSE: 5.55775

Running time: 0 seconds`

xinychen commented 2 years ago

Hello, thank you for this question! If your tensor data is of size 1-by-1440-by-6, this is really a matrix. Please consider a matrix completion model rather than tensor completion models.

Best regards, Xinyu

cq70605 commented 2 years ago

Hello, thank you for this question! If your tensor data is of size 1-by-1440-by-6, this is really a matrix. Please consider a matrix completion model rather than tensor completion models.

Best regards, Xinyu

Thank you for your answer. Now I only use the data collected by just one sensor, so my tensor data is of size 1-by-1440-by-6. Does that mean if I use data collected by n sensors and get the tensor data of size n-by-1440-by-6, then I can consider a tensor completion model. By the way, is there any matrix model recommended.

xinychen commented 2 years ago

Yeah, you can consider tensor completion model, but in LRTC-TNN, theta should be smaller than min{n, 1440, 6}.