Open paulasanematsu opened 2 weeks ago
Wow, you are really putting our code to the test. Fun!
Here are my first reactions to what you have already tested:
Questions:
iebaltab
does not implement any GPU support beyond what Stata's built-in commands support. Therefore, it is hard for me to say how much of a difference this would make.Suggestions:
set max_memory .
, which allows Stata to manage memory dynamically as needed. While this is unlikely to be the cause of the issue since it does not seem memory-related, it is good to be aware of this setting.Let me know what these comments make you think or what these suggestions teaches you. Happy to keep working with you until this is resolved. However, it might also be related to Stata (especially on Linux) where I would not be able to help with a solution.
Glad to hear this is a "fun" problem!
Answering your questions:
/tmp
space. They are the nodes under fasse_gpu
partition in our documentation.Based on Kristoffer thoughts and suggestions, I have a few suggestions for Raul so we can better understand what is happening:
For all runs below, set max_memory
to a little lower than the requested memory (e.g. if you request 200G, set max_memory to 190 GB). Although we don't have indications of a memory issue, I think this is safer than no setting at all.
Rerun a do-file that uses a 5% random sample. I would like to test the 1h timeout hypothesis. You ran this before, but I would like to confirm that it ran beyond the 1h limit. If Stata allows, can you printout the date and time before and after the iebaltab
function so we know how long that particular function ran? If the 5% runs in less than 1h, then increase the sample subset.
Run the original do-file using the fasse_gpu
partition (i.e. GPU-enabled computer). To use these, when you request a Stata session, you have to request the fasse_gpu
partition and request 1 in the "Number of GPUs". I am not sure if Stata needs extra settings to run on a GPU or if it works out of the box. You can check if the GPU card is being used by opening a terminal (Applications on the top left corner-> Terminal Emulator). Then execute the command nvtop
. If the GPU is being used, you will see a graph with GPU % and GPU mem % being used.
Raul, if you prefer to prepare a do-file for #2, I will be happy to run and observe CPU and memory usage while it runs.
Does Stata have a built-in profiler to show how much time and memory a code uses each function? If yes, it would be worth using a profiler in these additional tests.
Hello,
I am a Research Computing Facilitator at FASRC. Raul Duarte reached out to our support because he was running a Stata code with the function
iebaltab
on our cluster and the job was dying midway through computation. We troubleshot extensively without much progress, so we are reaching out to you for guidance. I will try to summarize the computational environment and what we have done so far.Unfortunately, because Raul’s data cannot be shared (because of a Data Use Agreement [DUA] signed), we cannot share the data, but we will try to explain as much as possible.
Computational environment
fasse_bigmem
partition: Intel Ice Lake chipset, 499 GB of RAM,/tmp
space is 172 GBfasse_ultramem
partition: Intel Ice Lake chipset, 2000 GB of RAM,/tmp
space is 396 GBAnalysis
Raul wrote a Do file that uses the iebaltab function to analyze a dataset that is 4.3GB:
Raul wrote:
His typical run was on
fasse_bigmem
(499 GB of RAM and 64 cores).Troubleshooting steps
max_memory
to slightly less than the total memory, he set it to 495 GB when the memory requested onfasse_bigmem
was 499 GB.top
to see cpu and memory usage and I also kept checking the disk usage of/tmp
with thedu
command. The core usage was almost at 100% for all 64 cores, memory was at about 5-6% (of 499 GB), and /tmp had about 4-5 GB usage. At about 1h, I could see each process dying and everything stalled.I am hoping that you have some guidance if Raul possibly ran into a bug or something on our end that we need to change.
Thank you for taking the time to read this. We will be happy to answer any questions.
Best, Paula and Raul