Open Zarquan opened 3 years ago
I'll take this. For now it's a documentation issue: users need to be aware that pandas is non-distributed. IIUC as soon as a Spark DF is transmogrified into a Pandas DF a "collect" happens to slurp the data to the Zeppelin process head node, creating a potentially huge memory footprint there. So the immediate action is to document (at least) as a trouble-shooting issue.
Immediate action (documenting a clear cautionary note) has been discharged, and the ML cuts is not in the example notebooks presented to users. Removing the DR3 milestone label, but leaving open for further consideration.
From Dennis's ML cuts notebook:
with possibly 10^6 values caused the Zeppelin node to die.
No response to network, including
ssh
.