Pandas list sort kills Zeppelin node

wfau / gaia-dmp

Gaia data analysis platform

GNU General Public License v3.0

1 stars 5 forks source link

Pandas list sort kills Zeppelin node #541

Open Zarquan opened 3 years ago

Zarquan commented 3 years ago

From Dennis's ML cuts notebook:

len(set(df.select('hpx6').toPandas().values))

with possibly 10^6 values caused the Zeppelin node to die.

No response to network, including ssh.

NigelHambly commented 2 years ago

I'll take this. For now it's a documentation issue: users need to be aware that pandas is non-distributed. IIUC as soon as a Spark DF is transmogrified into a Pandas DF a "collect" happens to slurp the data to the Zeppelin process head node, creating a potentially huge memory footprint there. So the immediate action is to document (at least) as a trouble-shooting issue.

NigelHambly commented 2 years ago

Immediate action (documenting a clear cautionary note) has been discharged, and the ML cuts is not in the example notebooks presented to users. Removing the DR3 milestone label, but leaving open for further consideration.