wshi83 / EhrAgent

[preprint'24] EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records
https://wshi83.github.io/EHR-Agent-page/
48 stars 5 forks source link

About Long-term Memory #2

Closed Heaven-zhw closed 1 week ago

Heaven-zhw commented 3 months ago

Hello! I'm very interested in this project. I notice you store the correctly predicted examples in memory. But it seems that you have seen the previous test data with golden label when you predict the remaining test data. Does it cause a data leakage problem?

night-chen commented 3 months ago

In our main experiments, we follow the reinforcement learning setting [1][2] to leverage the long-term memory mechanism to improve performance by justifying the necessity of selecting the most relevant demonstrations for planning.

In order to simulate the scenario where the ground truth annotations (i.e., rewards) are unavailable, we further evaluate the effectiveness of the long-term memory on the completed cases in Appendix Table 7, regardless of whether they are successful or not. The results indicate that the inclusion of long-term memory with completed cases increases the completion rate but tends to reduce the success rate across most difficulty levels, as some incorrect cases might be included as the few-shot demonstrations. Nonetheless, it still outperforms the performance without long-term memory, confirming the effectiveness of the memory mechanism.

[1] Shinn, Noah, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. "Reflexion: Language agents with verbal reinforcement learning." Advances in Neural Information Processing Systems 36 (2024). [2] Sun, Haotian, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. "Adaplanner: Adaptive planning from feedback with language models." Advances in Neural Information Processing Systems 36 (2024).

Heaven-zhw commented 3 months ago

In our main experiments, we follow the reinforcement learning setting [1][2] to leverage the long-term memory mechanism to improve performance by justifying the necessity of selecting the most relevant demonstrations for planning.

In order to simulate the scenario where the ground truth annotations (i.e., rewards) are unavailable, we further evaluate the effectiveness of the long-term memory on the completed cases in Appendix Table 7, regardless of whether they are successful or not. The results indicate that the inclusion of long-term memory with completed cases increases the completion rate but tends to reduce the success rate across most difficulty levels, as some incorrect cases might be included as the few-shot demonstrations. Nonetheless, it still outperforms the performance without long-term memory, confirming the effectiveness of the memory mechanism.

[1] Shinn, Noah, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. "Reflexion: Language agents with verbal reinforcement learning." Advances in Neural Information Processing Systems 36 (2024). [2] Sun, Haotian, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. "Adaplanner: Adaptive planning from feedback with language models." Advances in Neural Information Processing Systems 36 (2024).

In our main experiments, we follow the reinforcement learning setting [1][2] to leverage the long-term memory mechanism to improve performance by justifying the necessity of selecting the most relevant demonstrations for planning.

In order to simulate the scenario where the ground truth annotations (i.e., rewards) are unavailable, we further evaluate the effectiveness of the long-term memory on the completed cases in Appendix Table 7, regardless of whether they are successful or not. The results indicate that the inclusion of long-term memory with completed cases increases the completion rate but tends to reduce the success rate across most difficulty levels, as some incorrect cases might be included as the few-shot demonstrations. Nonetheless, it still outperforms the performance without long-term memory, confirming the effectiveness of the memory mechanism.

[1] Shinn, Noah, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. "Reflexion: Language agents with verbal reinforcement learning." Advances in Neural Information Processing Systems 36 (2024). [2] Sun, Haotian, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. "Adaplanner: Adaptive planning from feedback with language models." Advances in Neural Information Processing Systems 36 (2024).

Thanks for your prompt reply. Due to the fact that the candidate demonstrations are added sequentially in long-term memory, I think the result may change if the test datasets are shuffled. Is the order of predicting taken into account in your experiments?

wshi83 commented 3 months ago

In our main experiments, we follow the reinforcement learning setting [1][2] to leverage the long-term memory mechanism to improve performance by justifying the necessity of selecting the most relevant demonstrations for planning. In order to simulate the scenario where the ground truth annotations (i.e., rewards) are unavailable, we further evaluate the effectiveness of the long-term memory on the completed cases in Appendix Table 7, regardless of whether they are successful or not. The results indicate that the inclusion of long-term memory with completed cases increases the completion rate but tends to reduce the success rate across most difficulty levels, as some incorrect cases might be included as the few-shot demonstrations. Nonetheless, it still outperforms the performance without long-term memory, confirming the effectiveness of the memory mechanism. [1] Shinn, Noah, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. "Reflexion: Language agents with verbal reinforcement learning." Advances in Neural Information Processing Systems 36 (2024). [2] Sun, Haotian, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. "Adaplanner: Adaptive planning from feedback with language models." Advances in Neural Information Processing Systems 36 (2024).

In our main experiments, we follow the reinforcement learning setting [1][2] to leverage the long-term memory mechanism to improve performance by justifying the necessity of selecting the most relevant demonstrations for planning. In order to simulate the scenario where the ground truth annotations (i.e., rewards) are unavailable, we further evaluate the effectiveness of the long-term memory on the completed cases in Appendix Table 7, regardless of whether they are successful or not. The results indicate that the inclusion of long-term memory with completed cases increases the completion rate but tends to reduce the success rate across most difficulty levels, as some incorrect cases might be included as the few-shot demonstrations. Nonetheless, it still outperforms the performance without long-term memory, confirming the effectiveness of the memory mechanism. [1] Shinn, Noah, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. "Reflexion: Language agents with verbal reinforcement learning." Advances in Neural Information Processing Systems 36 (2024). [2] Sun, Haotian, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. "Adaplanner: Adaptive planning from feedback with language models." Advances in Neural Information Processing Systems 36 (2024).

Thanks for your prompt reply. Due to the fact that the candidate demonstrations are added sequentially in long-term memory, I think the result may change if the test datasets are shuffled. Is the order of predicting taken into account in your experiments?

Thank you for your reply and follow-up questions. We have performed multi-round experiments with shuffled order and observed that there was almost no influence on the final performance in all three datasets.