weld-project / weld

High-performance runtime for data analytics applications
https://www.weld.rs
BSD 3-Clause "New" or "Revised" License
3k stars 257 forks source link

Performance of native data_cleaning and grizzly data_cleaning #121

Open hustnn opened 7 years ago

hustnn commented 7 years ago

I tried the grizzly data_cleaning demo in https://github.com/weld-project/weld/tree/master/examples/python/grizzly in my macbook.

The pandas version is 0.19.2.

The performance improvement actually is very close.

Native: Total end-to-end time: 1.62 Grizzly: Total end-to-end time, including compilation: 2.03

How is the result in your testbed if you also use the latest pandas?

deepakn94 commented 7 years ago

Hi @hustnn, Thanks for your interest in the project! I have Pandas 0.19.2 on my machine as well. I'm seeing these times (I might be using a different dataset to you):

Native: Total end-to-end time: 7.25
Grizzly: Total end-to-end time, including compilation: 2.61

How many threads are you running this with? I pushed a commit a couple of days ago that set the number of threads used by default to a more sane number (1). Also Grizzly has some fixed overhead (compilation time) that gets amortized away if you run these workloads on larger datasets -- this script helps to make the default dataset larger for testing purposes.

hustnn commented 7 years ago

Hi @deepakn94 Thanks for the your reply. I will try it again and give the results later.

hustnn commented 7 years ago

@deepakn94

I increase the duplicate factor to 300. Now, the performance improvement becomes obvious.

Pandas: 16s, Grizzly: 8s (2 threads).

I found that memory usage of grizzle is much larger than pandas. Then I go into it and find that it is may be caused by change the encoding type when calling raw_column = np.array(self.df[key], dtype=str).

Can it be optimized by keeping the original encoding type in dataframe[key].values and perform the conversion at runtime? I open a issue here https://github.com/weld-project/weld/issues/128