yandex-research / DeDLOC

Official code for "Distributed Deep Learning in Open Collaborations" (NeurIPS 2021)
https://arxiv.org/abs/2106.10207
Apache License 2.0
116 stars 6 forks source link

Problems when trying to run the albert example #4

Closed soodoshll closed 2 years ago

soodoshll commented 2 years ago

Hi! Thank you for this amazing project! I'm trying to reproduce the experiment result in the paper but encountered some questions:

I'm using python3.9 and following the instructions in the readme file.

1. Data pre-processing

When I tried to run the command python tokenize_wikitext103.py, it shows an error message like:

Traceback (most recent call last):                                                                                                       
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker                           
    result = (True, func(*args, **kwds))                                                                                                 
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 518, in wrapper                     
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                                                                   
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 485, in wrapper                     
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                                                                   
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/fingerprint.py", line 411, in wrapper                       
    out = func(self, *args, **kwargs)                                                                                                    
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2469, in _map_single                
    batch = apply_function_on_filtered_inputs(                                                                                           
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2357, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)                                                                 
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2052, in decorated                  
    result = f(decorated_item, *args, **kwargs)                                                                                          
  File "/home/su/DeDLOC/albert/tokenize_wikitext103.py", line 82, in tokenize_function                                                   
    instances = create_instances_from_document(tokenizer, text, max_seq_length=512)                                                      
  File "/home/su/DeDLOC/albert/tokenize_wikitext103.py", line 24, in create_instances_from_document                                      
    segmented_sents = list(nltk.sent_tokenize(document))                                                                                 
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize               
    return tokenizer.tokenize(text)                                                                                                      
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize                      
    return list(self.sentences_from_text(text, realign_boundaries))                                                                      
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text           
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]                                                          
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in <listcomp>                    
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]                                                          
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize                 
    for sentence in slices:                                                                                                              
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries
    for sentence1, sentence2 in _pair_iter(slices):
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter
    prev = next(iterator)
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text
    for match, context in self._match_potential_end_contexts(text):
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts
    before_words[match] = split[-1]
IndexError: list index out of range

2. the API URL does not exist

When I tried to run the GPU trainer, it shows this error message:

Traceback (most recent call last):
  File "/home/su/DeDLOC/albert/run_trainer.py", line 297, in <module>
    main()
  File "/home/su/DeDLOC/albert/run_trainer.py", line 225, in main
    tokenizer = AlbertTokenizerFast.from_pretrained(dataset_args.tokenizer_path, cache_dir=dataset_args.cache_dir)
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1654, in from_pretrained
    fast_tokenizer_file = get_fast_tokenizer_file(
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3486, in get_fast_tokenizer_file
    all_files = get_list_of_files(
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/transformers/file_utils.py", line 2103, in get_list_of_files
    return list_repo_files(path_or_repo, revision=revision, token=token)
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 602, in list_repo_files
    info = self.model_info(
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 586, in model_info
    r.raise_for_status()
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/data/tokenizer
soodoshll commented 2 years ago

I downgraded nltk to 3.6.2 and the first problem is solved.

borzunov commented 2 years ago

Hi!

The second problem is a consequence of the first one: this (definitely obscure) error message is shown when the script can't find the ./data directory, which is the output of ./tokenize_wikitext103.py script. Running this script again should help.

The seemingly unrelated error requests.exceptions.HTTPError is shown because the script looks for the tokenizer with the provided name online if it fails to find it locally.

Note: If you are verifying the plots/numbers reported in the paper, you're correct to use this repository. In contrast, if your goal is to try out collaborative training (or set up your own experiment), consider using a newer version of the hivemind library with a newer version of the ALBERT example from https://github.com/learning-at-home/hivemind repository. It has many substantial improvements, including this obscure error message fixed.

On Sun, Dec 26, 2021, 04:27 Qidong Su @.***> wrote:

I downgraded nltk to 3.6.2 and the first problem is solved.

— Reply to this email directly, view it on GitHub https://github.com/yandex-research/DeDLOC/issues/4#issuecomment-1001091330, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCX7DYXRFDVVWQDTY5T2TTUSZVOJANCNFSM5KYKU4EA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

soodoshll commented 2 years ago

Thanks, Alexander. That solves the problem.