Kernel estimate generates different results despite set.seed()

DianaDaInLee commented 2 years ago

Hello,

I'm using interflex() with estimator = "kernel". Even when I set my seed using set.seed() prior to running the model, the results are different each time I run it. The difference in magnitude is large enough such that sometimes the result generates CI's that cover 0 for a large portion of the X variable, and sometimes it doesn't. Is there a way to make the result reproducible?

different results based on no seed and different versions under same seeds

sample code

set.seed(6748259)
ncore <- parallel::detectCores() - 1
interflex(estimator = "kernel", 
             data      = data.frame(df.merge), 
             Y =  'y', 
             D =  'd', 
             X =  'x', 
             Z = c("unemp_delta.s",  "owner_delta.s"), 
             na.rm = T, neval = 500, cores = ncore, parallel = T,
             kfold = 10, nboots = 200, nsimu = 1000)

sessionInfo()


R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS 12.4

Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] grid stats graphics grDevices utils datasets methods base

other attached packages: [1] dplyr_1.0.9 gridExtra_2.3 interflex_1.2.6 gghighlight_0.3.2 hrbrthemes_0.8.6 ggplot2_3.3.6 here_1.0.1 fixest_0.10.4

loaded via a namespace (and not attached): [1] splines_4.0.4 foreach_1.5.2 carData_3.0-4 AER_1.2-9 Formula_1.2-4 assertthat_0.2.1
[7] yulab.utils_0.0.4 cellranger_1.1.0 globals_0.14.0 gdtools_0.2.4 Rttf2pt1_1.3.10 numDeriv_2016.8-1.1 [13] pillar_1.7.0 lattice_0.20-41 glue_1.6.2 pROC_1.18.0 extrafontdb_1.0 digest_0.6.29
[19] RColorBrewer_1.1-3 colorspace_2.0-3 sandwich_3.0-0 htmltools_0.5.2 Matrix_1.3-2 plyr_1.8.7
[25] pkgconfig_2.0.3 listenv_0.8.0 haven_2.4.3 purrr_0.3.4 xtable_1.8-4 mvtnorm_1.1-3
[31] scales_1.2.0 ggplotify_0.1.0 openxlsx_4.2.3 rio_0.5.16 tibble_3.1.7 mgcv_1.8-33
[37] car_3.0-10 generics_0.1.3 pcse_1.9.1.1 ellipsis_0.3.2 withr_2.5.0 Lmoments_1.3-1
[43] cli_3.3.0 survival_3.2-7 readxl_1.3.1 magrittr_2.0.3 crayon_1.5.1 evaluate_0.15
[49] future_1.23.0 fansi_1.0.3 parallelly_1.30.0 doParallel_1.0.16 nlme_3.1-152 MASS_7.3-53
[55] forcats_0.5.1 foreign_0.8-81 dreamerr_1.2.3 tools_4.0.4 data.table_1.14.2 hms_1.1.1
[61] lifecycle_1.0.1 munsell_0.5.0 zip_2.1.1 compiler_4.0.4 lfe_2.8-6 systemfonts_1.0.4
[67] gridGraphics_0.5-1 rlang_1.0.3 iterators_1.0.14 rstudioapi_0.13 rmarkdown_2.14 gtable_0.3.0
[73] ModelMetrics_1.2.2.2 codetools_0.2-18 abind_1.4-5 curl_4.3.2 DBI_1.1.1 R6_2.5.1
[79] zoo_1.8-8 knitr_1.39 fastmap_1.1.0 extrafont_0.18 utf8_1.2.2 rprojroot_2.0.2
[85] stringi_1.7.6 parallel_4.0.4 Rcpp_1.0.8.3 vctrs_0.4.1 tidyselect_1.1.2 xfun_0.31
[91] lmtest_0.9-38

xuyiqing commented 2 years ago

Hi, have you tried setting nboots=1000. the result should be the same without parallel computing.

Hi Ziyi, do you have any clue why this may happen?

On Thu, Jul 7, 2022 at 1:26 PM dlee0324 @.***> wrote:

Hello,

I'm using interflex() with estimator = "kernel". Even when I set my seed using set.seed() prior to running the model, the results are different each time I run it. The difference in magnitude is large enough such that sometimes the result generates CI's that cover 0 for a large portion of the X variable, and sometimes it doesn't. Is there a way to make the result reproducible?

-

different results based on seeds test.pdf https://github.com/xuyiqing/interflex/files/9067107/test.pdf

sample code

set.seed(6748259) ncore <- parallel::detectCores() - 1 interflex(estimator = "kernel", data = data.frame(df.merge), Y = 'y', D = 'd', X = 'x', Z = c("unemp_delta.s", "owner_delta.s"), na.rm = T, neval = 500, cores = ncore, parallel = T, kfold = 10, nboots = 200, nsimu = 1000)

sessionInfo()

R version 4.0.4 (2021-02-15) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS 12.4

Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] grid stats graphics grDevices utils datasets methods base

other attached packages: [1] dplyr_1.0.9 gridExtra_2.3 interflex_1.2.6 gghighlight_0.3.2 hrbrthemes_0.8.6 ggplot2_3.3.6 here_1.0.1 fixest_0.10.4

loaded via a namespace (and not attached): [1] splines_4.0.4 foreach_1.5.2 carData_3.0-4 AER_1.2-9 Formula_1.2-4 assertthat_0.2.1 [7] yulab.utils_0.0.4 cellranger_1.1.0 globals_0.14.0 gdtools_0.2.4 Rttf2pt1_1.3.10 numDeriv_2016.8-1.1 [13] pillar_1.7.0 lattice_0.20-41 glue_1.6.2 pROC_1.18.0 extrafontdb_1.0 digest_0.6.29 [19] RColorBrewer_1.1-3 colorspace_2.0-3 sandwich_3.0-0 htmltools_0.5.2 Matrix_1.3-2 plyr_1.8.7 [25] pkgconfig_2.0.3 listenv_0.8.0 haven_2.4.3 purrr_0.3.4 xtable_1.8-4 mvtnorm_1.1-3 [31] scales_1.2.0 ggplotify_0.1.0 openxlsx_4.2.3 rio_0.5.16 tibble_3.1.7 mgcv_1.8-33 [37] car_3.0-10 generics_0.1.3 pcse_1.9.1.1 ellipsis_0.3.2 withr_2.5.0 Lmoments_1.3-1 [43] cli_3.3.0 survival_3.2-7 readxl_1.3.1 magrittr_2.0.3 crayon_1.5.1 evaluate_0.15 [49] future_1.23.0 fansi_1.0.3 parallelly_1.30.0 doParallel_1.0.16 nlme_3.1-152 MASS_7.3-53 [55] forcats_0.5.1 foreign_0.8-81 dreamerr_1.2.3 tools_4.0.4 data.table_1.14.2 hms_1.1.1 [61] lifecycle_1.0.1 munsell_0.5.0 zip_2.1.1 compiler_4.0.4 lfe_2.8-6 systemfonts_1.0.4 [67] gridGraphics_0.5-1 rlang_1.0.3 iterators_1.0.14 rstudioapi_0.13 rmarkdown_2.14 gtable_0.3.0 [73] ModelMetrics_1.2.2.2 codetools_0.2-18 abind_1.4-5 curl_4.3.2 DBI_1.1.1 R6_2.5.1 [79] zoo_1.8-8 knitr_1.39 fastmap_1.1.0 extrafont_0.18 utf8_1.2.2 rprojroot_2.0.2 [85] stringi_1.7.6 parallel_4.0.4 Rcpp_1.0.8.3 vctrs_0.4.1 tidyselect_1.1.2 xfun_0.31 [91] lmtest_0.9-38

— Reply to this email directly, view it on GitHub https://github.com/xuyiqing/interflex/issues/10, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2PKGG5FDY7D6AIBKCCSETVS44NVANCNFSM526VXZJA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Sent from IPhone. Please pardon typos and brevity.

DianaDaInLee commented 2 years ago

Hi, have you tried setting nboots=1000. the result should be the same without parallel computing. Hi Ziyi, do you have any clue why this may happen? On Thu, Jul 7, 2022 at 1:26 PM dlee0324 @.> wrote: Hello, I'm using interflex() with estimator = "kernel". Even when I set my seed using set.seed() prior to running the model, the results are different each time I run it. The difference in magnitude is large enough such that sometimes the result generates CI's that cover 0 for a large portion of the X variable, and sometimes it doesn't. Is there a way to make the result reproducible? - different results based on seeds test.pdf https://github.com/xuyiqing/interflex/files/9067107/test.pdf - sample code set.seed(6748259) ncore <- parallel::detectCores() - 1 interflex(estimator = "kernel", data = data.frame(df.merge), Y = 'y', D = 'd', X = 'x', Z = c("unemp_delta.s", "owner_delta.s"), na.rm = T, neval = 500, cores = ncore, parallel = T, kfold = 10, nboots = 200, nsimu = 1000) - sessionInfo() R version 4.0.4 (2021-02-15) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS 12.4 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] grid stats graphics grDevices utils datasets methods base other attached packages: [1] dplyr_1.0.9 gridExtra_2.3 interflex_1.2.6 gghighlight_0.3.2 hrbrthemes_0.8.6 ggplot2_3.3.6 here_1.0.1 fixest_0.10.4 loaded via a namespace (and not attached): [1] splines_4.0.4 foreach_1.5.2 carData_3.0-4 AER_1.2-9 Formula_1.2-4 assertthat_0.2.1 [7] yulab.utils_0.0.4 cellranger_1.1.0 globals_0.14.0 gdtools_0.2.4 Rttf2pt1_1.3.10 numDeriv_2016.8-1.1 [13] pillar_1.7.0 lattice_0.20-41 glue_1.6.2 pROC_1.18.0 extrafontdb_1.0 digest_0.6.29 [19] RColorBrewer_1.1-3 colorspace_2.0-3 sandwich_3.0-0 htmltools_0.5.2 Matrix_1.3-2 plyr_1.8.7 [25] pkgconfig_2.0.3 listenv_0.8.0 haven_2.4.3 purrr_0.3.4 xtable_1.8-4 mvtnorm_1.1-3 [31] scales_1.2.0 ggplotify_0.1.0 openxlsx_4.2.3 rio_0.5.16 tibble_3.1.7 mgcv_1.8-33 [37] car_3.0-10 generics_0.1.3 pcse_1.9.1.1 ellipsis_0.3.2 withr_2.5.0 Lmoments_1.3-1 [43] cli_3.3.0 survival_3.2-7 readxl_1.3.1 magrittr_2.0.3 crayon_1.5.1 evaluate_0.15 [49] future_1.23.0 fansi_1.0.3 parallelly_1.30.0 doParallel_1.0.16 nlme_3.1-152 MASS_7.3-53 [55] forcats_0.5.1 foreign_0.8-81 dreamerr_1.2.3 tools_4.0.4 data.table_1.14.2 hms_1.1.1 [61] lifecycle_1.0.1 munsell_0.5.0 zip_2.1.1 compiler_4.0.4 lfe_2.8-6 systemfonts_1.0.4 [67] gridGraphics_0.5-1 rlang_1.0.3 iterators_1.0.14 rstudioapi_0.13 rmarkdown_2.14 gtable_0.3.0 [73] ModelMetrics_1.2.2.2 codetools_0.2-18 abind_1.4-5 curl_4.3.2 DBI_1.1.1 R6_2.5.1 [79] zoo_1.8-8 knitr_1.39 fastmap_1.1.0 extrafont_0.18 utf8_1.2.2 rprojroot_2.0.2 [85] stringi_1.7.6 parallel_4.0.4 Rcpp_1.0.8.3 vctrs_0.4.1 tidyselect_1.1.2 xfun_0.31 [91] lmtest_0.9-38 — Reply to this email directly, view it on GitHub <#10>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2PKGG5FDY7D6AIBKCCSETVS44NVANCNFSM526VXZJA . You are receiving this because you are subscribed to this thread.Message ID: @.> -- Sent from IPhone. Please pardon typos and brevity.

Thanks for the quick reply -- I just re-ran it with no parallel (parallel=F and removing core option) and set nboots = 1000. The results look much more similar now, but they are still not identical:

xuyiqing commented 2 years ago

We'll take a look at it on our end. Thanks for reaching out!

On Thu, Jul 7, 2022 at 8:03 PM dlee0324 @.***> wrote:

Hi, have you tried setting nboots=1000. the result should be the same without parallel computing. Hi Ziyi, do you have any clue why this may happen? On Thu, Jul 7, 2022 at 1:26 PM dlee0324 @.> wrote: Hello, I'm using interflex() with estimator = "kernel". Even when I set my seed using set.seed() prior to running the model, the results are different each time I run it. The difference in magnitude is large enough such that sometimes the result generates CI's that cover 0 for a large portion of the X variable, and sometimes it doesn't. Is there a way to make the result reproducible? - different results based on seeds test.pdf https://github.com/xuyiqing/interflex/files/9067107/test.pdf https://github.com/xuyiqing/interflex/files/9067107/test.pdf - sample code set.seed(6748259) ncore <- parallel::detectCores() - 1 interflex(estimator = "kernel", data = data.frame(df.merge), Y = 'y', D = 'd', X = 'x', Z = c("unemp_delta.s", "owner_delta.s"), na.rm = T, neval = 500, cores = ncore, parallel = T, kfold = 10, nboots = 200, nsimu = 1000) - sessionInfo() R version 4.0.4 (2021-02-15) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS 12.4 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] grid stats graphics grDevices utils datasets methods base other attached packages: [1] dplyr_1.0.9 gridExtra_2.3 interflex_1.2.6 gghighlight_0.3.2 hrbrthemes_0.8.6 ggplot2_3.3.6 here_1.0.1 fixest_0.10.4 loaded via a namespace (and not attached): [1] splines_4.0.4 foreach_1.5.2 carData_3.0-4 AER_1.2-9 Formula_1.2-4 assertthat_0.2.1 [7] yulab.utils_0.0.4 cellranger_1.1.0 globals_0.14.0 gdtools_0.2.4 Rttf2pt1_1.3.10 numDeriv_2016.8-1.1 [13] pillar_1.7.0 lattice_0.20-41 glue_1.6.2 pROC_1.18.0 extrafontdb_1.0 digest_0.6.29 [19] RColorBrewer_1.1-3 colorspace_2.0-3 sandwich_3.0-0 htmltools_0.5.2 Matrix_1.3-2 plyr_1.8.7 [25] pkgconfig_2.0.3 listenv_0.8.0 haven_2.4.3 purrr_0.3.4 xtable_1.8-4 mvtnorm_1.1-3 [31] scales_1.2.0 ggplotify_0.1.0 openxlsx_4.2.3 rio_0.5.16 tibble_3.1.7 mgcv_1.8-33 [37] car_3.0-10 generics_0.1.3 pcse_1.9.1.1 ellipsis_0.3.2 withr_2.5.0 Lmoments_1.3-1 [43] cli_3.3.0 survival_3.2-7 readxl_1.3.1 magrittr_2.0.3 crayon_1.5.1 evaluate_0.15 [49] future_1.23.0 fansi_1.0.3 parallelly_1.30.0 doParallel_1.0.16 nlme_3.1-152 MASS_7.3-53 [55] forcats_0.5.1 foreign_0.8-81 dreamerr_1.2.3 tools_4.0.4 data.table_1.14.2 hms_1.1.1 [61] lifecycle_1.0.1 munsell_0.5.0 zip_2.1.1 compiler_4.0.4 lfe_2.8-6 systemfonts_1.0.4 [67] gridGraphics_0.5-1 rlang_1.0.3 iterators_1.0.14 rstudioapi_0.13 rmarkdown_2.14 gtable_0.3.0 [73] ModelMetrics_1.2.2.2 codetools_0.2-18 abind_1.4-5 curl_4.3.2 DBI_1.1.1 R6_2.5.1 [79] zoo_1.8-8 knitr_1.39 fastmap_1.1.0 extrafont_0.18 utf8_1.2.2 rprojroot_2.0.2 [85] stringi_1.7.6 parallel_4.0.4 Rcpp_1.0.8.3 vctrs_0.4.1 tidyselect_1.1.2 xfun_0.31 [91] lmtest_0.9-38 — Reply to this email directly, view it on GitHub <#10 https://github.com/xuyiqing/interflex/issues/10>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2PKGG5FDY7D6AIBKCCSETVS44NVANCNFSM526VXZJA https://github.com/notifications/unsubscribe-auth/AB2PKGG5FDY7D6AIBKCCSETVS44NVANCNFSM526VXZJA . You are receiving this because you are subscribed to this thread.Message ID: @.> -- Sent from IPhone. Please pardon typos and brevity.

Thanks for the quick reply -- I just re-ran it with no parallel ( parallel=F and removing core option) and set nboots = 1000. The results look much more similar now, but they are still not identical: [image: Screen Shot 2022-07-07 at 11 02 12 PM] https://user-images.githubusercontent.com/63002528/177908706-ff297079-7f39-4d9f-8d4a-c5549a3ffc6d.png

— Reply to this email directly, view it on GitHub https://github.com/xuyiqing/interflex/issues/10#issuecomment-1178488711, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2PKGGP6SMA2MNKYXVAG2DVS6K6TANCNFSM526VXZJA . You are receiving this because you commented.Message ID: @.***>

-- Yiqing Xu

Assistant Professor Department of Political Science Stanford University https://yiqingxu.org/

DianaDaInLee commented 2 years ago

Hello,

Thank you for investigating this!

I wanted to bring up a related issue I faced: it appears that setting different seed number changes the result quite a bit. I ran identical code (with identical data) but with two different seeds (10027 and 67482957) and got the following results:

These were run with nboot = 1000 with no parallel. Some variation due to the seed is expected, but this seems to change the functional form, which is odd.

xuyiqing commented 2 years ago

I think this is because you can a different bandwidth out of cross-validation. Ziyi, any thoughts on stabilizing bandwidth selection? Running CV for more rounds is probably one solution.

On Sun, Jul 10, 2022 at 2:23 PM dlee0324 @.***> wrote:

Hello,

Thank you for investigating this!

I wanted to bring up a related issue I faced: it appears that setting different seed number changes the result quite a bit. I ran identical code (with identical data) but with two different seeds (10027 and 67482957) and got the following results:

[image: Screen Shot 2022-07-10 at 5 19 23 PM] https://user-images.githubusercontent.com/63002528/178162433-f78aed2f-4ec8-4197-b5e3-4d369ac41ff7.png

[image: Screen Shot 2022-07-10 at 5 19 38 PM] https://user-images.githubusercontent.com/63002528/178162438-8cadd628-5abd-4ea2-b74d-4d0f3285e148.png

Some variation due to the seed is expected, but this seems to change the functional form, which is odd.

— Reply to this email directly, view it on GitHub https://github.com/xuyiqing/interflex/issues/10#issuecomment-1179802250, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2PKGGCLH5NXYKLHDH42KTVTM5MNANCNFSM526VXZJA . You are receiving this because you commented.Message ID: @.***>

-- Yiqing Xu

Assistant Professor Department of Political Science Stanford University https://yiqingxu.org/

lzy318 commented 2 years ago

I think another solution is to convert your moderator to quantile values, it may help the cross-validation when your moderator has a long-tail distribution.

xuyiqing / interflex

Kernel estimate generates different results despite set.seed() #10

different results based on seeds test.pdf https://github.com/xuyiqing/interflex/files/9067107/test.pdf