zhaiyi000 / tlm

Apache License 2.0
24 stars 1 forks source link

Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning

This repo is based on TVM v0.12.0 and reuses some code from TenSet and TLP.

TLM has been integrated into Ansor, TVM(MetaSchedule), MindSpore's AKG and AKG-MLIR.

tlm slides

Installation

Getting Started Instructions

To get started quickly, you need to download tlm_dataset. Here we take compiling bert_base as an example.

cd gen

TLM-Ansor

cd gen
  1. Train TLM-base, taking the NVIDIA V100 as an example.

    • Partition the workload into subgraphs and save the results to the path dataset/network_info/v100.

      python dump_network_info.py --target=nvidia/nvidia-v100

      --target can be found in src/target/tag.cc for other hardware. For CPUs, it can be set to something like this --target="llvm -mcpu=core-avx2 -model=i7".

    • Dump tensor programs for those subgraphs. The resulting tensor programs have not measured execution latency and are unlabeled data. They will be saved to the path dataset/to_measure_programs/v100.

      python dump_programs.py --target=nvidia/nvidia-v100
    • Use unlabeled data to build a vocabulary and a tokenizer and save them to --tokenizer_path.

      python make_dataset.py \
      --for_type=for_gen_tokenizer \
      --target=nvidia/nvidia-v100 \
      --dataset_path=dataset/to_measure_programs/v100 \
      --tokenizer_path=gen_data/gen_tokenizer_v100
    • Use the tokenizer to train the TLM-base pre-train dataset and save it to --save_path.

      python make_dataset.py \
      --for_type=for_gen \
      --target=nvidia/nvidia-v100 \
      --dataset_path=dataset/to_measure_programs/v100 \
      --tokenizer_path=gen_data/gen_tokenizer_v100 \
      --save_path=gen_data/v100_gen_2154
    • Pre-train TLM-base. Adjust parameters such as batch_size in the run_train_clm.py file according to the GPU memory size. Requires apt install tmux.

      python run_train_clm.py
  2. Train the TLM using iterative optimization. We provide two methods: the script with one kick and the step-by-step command.

    A. The script with one kick. The script uses a pipeline system and requires two machines. Both machines need to clone the TLM repository. The two machines communicate and exchange data using ssh and rsync.

    • On the training machine, 1) configure ssh password-free login, and then configure the target machine in ~/.ssh/config, 2) set device_id_all in run.py to specify the GPU card IDs that can be used to train TLM; 3) Set ssh_target in run.py.

      python run.py --target=nvidia/nvidia-v100 --for_type=for_finetuning --finetuning_init=True
    • On the measurement machine, configure available_ids to specify the GPU card IDs that can be used for measurement.

      python run_measure.py

    B. Step-by-step command.

    • Before using TLM-base/TLM to generate tensor programs, generate prompts first.

      python make_dataset.py \
      --for_type=for_gen_train_sketch \
      --target=nvidia/nvidia-v100 \
      --dataset_path=dataset/to_measure_programs/v100 \
      --tokenizer_path=gen_data/gen_tokenizer_v100 \
      --save_path=gen_data/v100_gen_train \
      --keep_cnt=48 \
      --test_file_idx=0
    • Use TLM-base/TLM to generate tensor programs, --model_name_or_path specifies whether to use TLM-base or TLM.

      CUDA_VISIBLE_DEVICES=0,1,2,3 python gen_state.py \
      --target=nvidia/nvidia-v100 \
      --model_name_or_path=gen_data/clm_gen_v100/checkpoint-24000 \
      --sketch_path=gen_data/v100_gen_train/0_merge.json \
      --save_path=gen_data/v100_gen_train/gen_train.json \
      --allow_repeat=True \
      --keep_cnt=16
    • Measure the execution latency of the generated tensor program and 'manually' update the path of the measurement results to the utils.json file. There are many errors in the initial measurement data of iterative optimization. These errors are normal and will gradually decrease as the iteration proceeds.

      CUDA_VISIBLE_DEVICES=3 python measure_programs.py --batch-size=64 --target=nvidia/nvidia-v100 --to-measure-path=gen_data/v100_gen_train/gen_train.json --measured-path=gen_data/measure_data_v100/finetuning_0.json
    • Organize the measured programs into dataset/measure_records/v100.

      python postprocess.py --target=nvidia/nvidia-v100
    • Build an SFT dataset.

      python make_dataset.py \
      --for_type=for_gen_best \
      --target=nvidia/nvidia-v100 \
      --dataset_path=dataset/measure_records/v100 \
      --tokenizer_path=gen_data/gen_tokenizer_v100 \
      --save_path=gen_data/v100_gen_best
    • SFT TLM-base.

      python run_train_clm_best_v100.py
  3. Evaluation on target workload.

    • Generate prompts.

      python make_dataset.py \
      --for_type=for_gen_eval_sketch \
      --target=nvidia/nvidia-v100 \
      --dataset_path=dataset/to_measure_programs/v100 \
      --tokenizer_path=gen_data/gen_tokenizer_v100 \
      --save_path=gen_data/v100_gen_eval \
      --keep_cnt=64
    • Generate tensor programs.

      CUDA_VISIBLE_DEVICES=4 python gen_state.py \
      --model_name_or_path=gen_data/clm_gen_best_v100 \
      --sketch_path=gen_data/v100_gen_eval/0_merge.json \
      --save_path=gen_data/v100_gen_eval/gen_eval.json \
      --allow_repeat=True \
      --target=nvidia/nvidia-v100 \
      --keep_cnt=32
    • Measure the execution latency of the generated tensor program.

      CUDA_VISIBLE_DEVICES=3 python measure_programs.py --batch-size=64 --target=nvidia/nvidia-v100 --to-measure-path=gen_data/v100_gen_eval/gen_eval.json --measured-path=gen_data/measure_data_v100/0_test_3.json
    • Use scripts to analyze the speedups.

      python speedup_eval.py --target=nvidia/nvidia-v100 --for_test=True
  4. When the tuning budget is ample, we continue to optimize TLM using the target workload data. There are also two ways.

    A. The script with one kick.

    • On the training machine.

      python run.py --target=nvidia/nvidia-v100 --for_type=for_testtuning --testtuning_init=True
    • On the measurement machine.

      python run_measure.py

    B. Step-by-step command.

    • Generate prompts.

      python make_dataset.py \
      --for_type=for_gen_evaltuning_sketch \
      --target=nvidia/nvidia-v100 \
      --dataset_path=dataset/to_measure_programs/v100 \
      --tokenizer_path=gen_data/gen_tokenizer_v100 \
      --save_path=gen_data/v100_gen_evaltuning \
      --keep_cnt=64
    • Generate tensor programs.

      CUDA_VISIBLE_DEVICES=0,1,2,3 python gen_state.py \
      --model_name_or_path=gen_data/clm_gen_best_v100 \
      --sketch_path=gen_data/v100_gen_evaltuning/0_merge.json \
      --save_path=gen_data/v100_gen_evaltuning/gen_eval.json \
      --allow_repeat=True \
      --target=nvidia/nvidia-v100 \
      --keep_cnt=32
    • Measure the execution latency of the generated tensor program and 'manually' update the path of the measurement results to the utils.json file.

      CUDA_VISIBLE_DEVICES=3 python measure_programs.py --batch-size=64 --target=nvidia/nvidia-v100 --to-measure-path=gen_data/v100_gen_evaltuning/gen_eval.json --measured-path=gen_data/measure_data_v100/testtuning_0.json
    • Organize the measured programs into dataset/measure_records/v100.

      python postprocess.py --target=nvidia/nvidia-v100
    • Build an SFT dataset.

      python make_dataset.py \
      --for_type=for_gen_best_all \
      --target=nvidia/nvidia-v100 \
      --dataset_path=dataset/measure_records/v100 \
      --tokenizer_path=gen_data/gen_tokenizer_v100 \
      --save_path=gen_data/v100_gen_best
    • SFT TLM-base

      python run_train_clm_best_v100.py
    • Not every task has the same optimization space. We use the task scheduler to allocate the tuning budget.

      python task_sheduler.py --target=nvidia/nvidia-v100 --for_testtuning=True

TLM-Meta

cd meta

Similar to TLM-Ansor, command lines can be found in run.sh and run.py.

License

TLM is licensed under the Apache-2.0 license.