wwrechard / pydlm

A python library for Bayesian time series modeling
BSD 3-Clause "New" or "Revised" License
475 stars 98 forks source link

Why use lists as input #7

Closed xgdgsc closed 7 years ago

xgdgsc commented 7 years ago

Would you consider replacing list with numpy arrays? It would be easier and maybe faster for the current data processing pipeline using pandas.

wwrechard commented 7 years ago

Thanks for the suggestion. I think you are fine to use numpy array as an input for the main class. The only place I enforce to use 2-d list is the input feature of a dynamic component (and any other inheritance). My original intention is the following two points.

  1. It is clearer than a 2d matrix. When using list of list, it is clear that each inner list is a feature, while using numpy matrix you have to specify explicitly either the row or the column should be an individual feature, which might cause error or confusion.

  2. Using numpy matrix might actually slow down the computation. At each step (date) of the Kalman filter, one feature vector that corresponds to this date will be extracted and joined with other evaluation vector (measurement vector) from other components to form the big evaluation vector, which will then be multiplied to the transition matrix. So putting the feature of an individual component in a matrix form does not help in computation (because you have to extract it out anyway for filtering). In addition, due to the special structure of the numpy matrix, you might be even slower when you try to extract one row or column out from the matrix compared to a 2d list.

However, it makes sense if you want to keep everything consistent in numpy format. I've added it to TODO list to add support for numpy matrix input. It is a simple change and should be included in next release.

xgdgsc commented 7 years ago
  1. That can follow the convention of feature row/column used in libraries like sklearn, and state clearly in doc.

Thanks.

xgdgsc commented 7 years ago

Nice!