Refactoring the repository

eubinecto commented 3 years ago

0. Download the data and define the path

예를 들면, 이렇게;

1. defining the indices

다음의 Document class를 storyteller/elastic/docs.py 에 정의한다. 구현해야하는 것은 총 두가지:

sents외에 추가해야하는 필드
stream_from_corpus(): 말뭉치 데이터를 파싱하여 해당 Doc의 객체를 stream.
Index 메타 클래스.

예를 들면, 감성대화의 경우 다음과 같이 정의:

class SC(Story):
    """
    감성 대화 인덱스
    """
    # --- additional fields for SC --- #
    profile_id = Keyword()
    talk_id = Keyword()

    @staticmethod
    def stream_from_corpus() -> Generator['SC', None, None]:
        train_json_path = os.path.join(SC_DIR, "Training", "감성대화말뭉치(최종데이터)_Training.json")
        val_json_path = os.path.join(SC_DIR, "Validation", "감성대화말뭉치(최종데이터)_Validation.json")

        for json_path in (train_json_path, val_json_path):
            with open(json_path, 'r') as fh:
                corpus_json = json.loads(fh.read())
                for sample in corpus_json:
                    yield SC(sents=" ".join(sample['talk']['content'].values()),
                             profile_id=sample['talk']['id']['profile-id'],
                             talk_id=sample['talk']['id']['talk-id'])

    class Index:
        # 해당 말뭉치의  인덱스 이름
        name = "sc_story"
        settings = Story.settings()

2. index

일단 1번을 끝내면, 인덱싱을 하는 것은 다음의 명령어로 바로 진행이 가능함.

python3 -m storyteller.main.index --index=gk_story  # indexing 일반상식 말뭉치
python3 -m storyteller.main.index --index=sc_story  # indexing  감성대화 말뭉치
python3 -m storyteller.main.index --index=mr_story # indexing  기계대화 말뭉치

3. search

이 부분은 아직 정확한 검색 로직은 미완성. 하지만 어느정도 검색은 가능함. 더 정확한 검색은 storyteller/elasitc/searcher.py 에 정의된 Searcher 클래스를 수정해야함.

python3 -m storyteller.main.search --wisdom="산 넘어 산"

4. build

storyteller가 관리하는 wandb artifacts는 다음과 같다:

wisdoms
wisdomify_test
wisdom2def
widom2eg

artifact를 wandb에 업로드 전, 먼저 각 artifact의 파일 및 디렉토리를 다음과 같이 로컬에 빌드한다:

data
├── corpora
└── wandb
    ├── artifacts
    │   ├── wisdom2def
    │   │   ├── wisdom2def.tsv
    │   │   ├── wisdom2def_raw.tsv
    │   │   ├── wisdom2def_train.tsv
    │   │   └── wisdom2def_val.tsv
    │   ├── wisdom2eg
    │   │   ├── wisdom2eg.tsv
    │   │   ├── wisdom2eg_raw.tsv
    │   │   ├── wisdom2eg_train.tsv
    │   │   └── wisdom2eg_val.tsv
    │   ├── wisdomify_test.tsv
    │   └── wisdoms.txt

이를 위한 스크립트는 다음과 같다:

python3 -m storyteller.main.build --artifact_name="wisdoms"
python3 -m storyteller.main.build --artifact_name="wisdomify_test"
python3 -m storyteller.main.build --artifact_name="wisdom2def"
python3 -m storyteller.main.build --artifact_name="wisdom2eg"

5. upload

일단 빌드가 마무리되면, 다음의 스크립트로 wandb에 업로드가 가능하다.

python3 -m storyteller.main.upload --artifact_name="wisdoms"
python3 -m storyteller.main.upload --artifact_name="wisdomify_test"
python3 -m storyteller.main.upload --artifact_name="wisdom2def"
python3 -m storyteller.main.upload --artifact_name="wisdom2eg"

Download??

말뭉치를 다운로드 하는 것은, 일단 보류. GCP 다운로더도 일단 제거했다. 물론 필요없다는 의미가 아니다. 지금 사용이 불가능해서, 잠시 제거 한 것 일 뿐. 추후에 다운로드 로직은 따로 추가가 필요할 듯.

eubinecto commented 3 years ago

scikit-learn이 M1에서 설치가 불가능함... 일단 이건 제외하고 필요한 것만 설치를 해보자.

eubinecto commented 2 years ago

리팩토링을 위한 구조를 잡기:

필요한 핵심기능:

download
parse
index
serach

이것들을 먼저 구현하는 것이 목표다.

wisdomify / storyteller