Performance on unseen networks and hardwares

lv2020 commented 3 months ago

Thanks for sharing your code and dataset! I have two main questions regarding the generalization performance of TLM:

I noticed that the same networks appear in both the training and test datasets, but with different input sizes. I'm curious about TLM's performance on completely unseen networks. For example, how does it perform if we train on CV models and then test on NLP models?
Can the pretrained model be easily transferred to a new platform? For instance, is it feasible to train on V100 and then test on A100?

zhaiyi000 commented 3 months ago

Hi, lv2020, two key questions.

Workloads (i.e., deep learning models) are divided into subgraphs. A subgraph consists of a type (e.g., matmul+elementwise_add, conv+relu) and a shape (i.e., input size). The subgraph types are limited (there should be only dozens), while the subgraph shape is unlimited (from a statistical point of view, the shape is also limited, that is, 80% of the subgraphs use 20% of the shape, or even 95% and 5%). The goal of tlm is not to train on the part of the subgraph types and inference on another part of the subgraph types but to train on all subgraphs, on the part of the shape, and inference on another part.
Yes, this is the advantage of tlm. Another purpose of adding hardware specifications to tensor sentences is to cross hardware. When cross-platform, we do not recommend "transfer" but simply training them together. Hardware specifications will distinguish the same subgraph into two prompts. The more data, the stronger the language model. The data collected on V100 is of great help to the data convergence of A100.

lv2020 commented 3 months ago

Thanks for your quick reply!

For the first question, my concern is whether there is label leakage in this case as only the shape is different. And if there is a new type of subgraph, then TLM needs to be retrained?

zhaiyi000 commented 3 months ago

We recommend retraining, which is the simplest and most effective way. TLM training is very fast. We found that pre-training on 4 V100s for 2 hours is almost perfect; if you load the previous checkpoint for pre-training, it will be quicker. I guess it only takes 30 minutes. Of course, it is not necessary to retrain. TLM's generalization ability comes from subgraph-type tokens and other tokens (such as subgraph shape and hardware specifications). Even if the subgraph type token is [UNKNOWN], it still has a certain generalization ability.

lv2020 commented 3 months ago

Thank you for your patience and detailed answer!

zhaiyi000 / tlm

Performance on unseen networks and hardwares #1