Open hustnn opened 7 years ago
Hi @hustnn, Thanks for your interest in the project! I have Pandas 0.19.2 on my machine as well. I'm seeing these times (I might be using a different dataset to you):
Native: Total end-to-end time: 7.25
Grizzly: Total end-to-end time, including compilation: 2.61
How many threads are you running this with? I pushed a commit a couple of days ago that set the number of threads used by default to a more sane number (1). Also Grizzly has some fixed overhead (compilation time) that gets amortized away if you run these workloads on larger datasets -- this script helps to make the default dataset larger for testing purposes.
Hi @deepakn94 Thanks for the your reply. I will try it again and give the results later.
@deepakn94
I increase the duplicate factor to 300. Now, the performance improvement becomes obvious.
Pandas: 16s, Grizzly: 8s (2 threads).
I found that memory usage of grizzle is much larger than pandas. Then I go into it and find that it is may be caused by change the encoding type when calling raw_column = np.array(self.df[key], dtype=str).
Can it be optimized by keeping the original encoding type in dataframe[key].values and perform the conversion at runtime? I open a issue here https://github.com/weld-project/weld/issues/128
I tried the grizzly data_cleaning demo in https://github.com/weld-project/weld/tree/master/examples/python/grizzly in my macbook.
The pandas version is 0.19.2.
The performance improvement actually is very close.
Native: Total end-to-end time: 1.62 Grizzly: Total end-to-end time, including compilation: 2.03
How is the result in your testbed if you also use the latest pandas?