Remove abxLS replace it with abxLSPhon

nhamilakis commented 1 year ago

Following the updates, we did with Mark, i think the old abxLS module can be completely removed and replaced with the abxLSPhon one since it can do the same evaluation as the old one.

The type of evaluation is decided on the context parameter.

context: any ⇒ will use the triphone item file and produce an abx2 style evaluation
context: within ⇒ will use the phoneme item file and produce an abx1 style evaluation
context: all ⇒ will use both the phoneme & triphone item files, call the ABX eval twice and produce both results.

Am i correct in my assumption ? i Would like your feedback @ewan @hallapmark Also, as Ewan said, maybe the benchmark name is not correct any more, should i change it to something else ?

hallapmark commented 1 year ago

Not exactly, but I am also not sure which level of abstraction we are talking about. I think it might help to distinguish between the labels and configurations used in the background in the abx2 code and the labels that are visible to the user of Benchmarks.

Publicly/Benchmark user visible labels (proposal, probably Ewan would be the stakeholder to say what the best names are) and corresponding abx2 configurations :

Context:

on-triphone (abx1-style) evaluation ⇒ will use triphone item file and run abx2-code within-context
within-context evaluation ⇒ will use phoneme item file and run abx2 within-context
any-context evaluation (alternatively we have also called it without-context) ⇒ will use phoneme item file and run abx2 any-context

All of these will run speaker_mode=all, i.e. there are separate scores for within speaker and across speaker for all these options.

Explanation: I think the source of the sort of taxonomic difficulty is that from the abx code's standpoint, the on-triphone evaluation is also "within context", it is just that what is extracted are triphones along with the context, i.e. preceding and following phoneme, included (abc, where a and c are the fixed context). Whereas in the within-context on-phoneme evaluation, a and c are still the fixed context, but only b is extracted. This difference between the two "within" context options is caused only by a difference in the timestamps in the item file. Literally, the old abx1 code could run on-phoneme within context and on-triphone within context, if you make two calls with corresponding item files. Whereas for the any-context condition code changes were needed.

I do not know how best to add a further layer of categorization and give the Benchmark user the option to run multiple of these at the same time. One option would be:

on-triphone AKA abx1-style evaluation ⇒ run the on-triphone evaluation
on-phoneme AKA abx2-style evaluation ⇒ run the within-context (phoneme item file) evaluation AND the any-context (phoneme item file) evaluation.

Or maybe just:

context: all

on-triphone (abx1-style) evaluation (with the corresponding configuration)
within-context evaluation (...)
any-context evaluation (...)

ewan commented 1 year ago

Yes, the two modules are functionally the same. The general name can be abxLS rather than abxLSPhon, even though we use the updated module.

Indeed, as @hallapmark indicates, there are two (mostly) independent dimensions, in addition to the speaker dimension, which means that the names all and any would be confusing. The first dimension is the items, which are either triphones are phones. The second is the context dimension, which can be either within or any (we sometimes also called this option whatever context, but we have been using). However, as you imply, it doesn't make much sense to run the triphone item file with the any option, therefore, we can do as Mark says and provide four options:

triphone: triphone item file, within-context
phoneme-within: phone item file, within-context
phoneme-any: phone item file any-context
all: run all four of the above

nhamilakis commented 1 year ago

I think i understand better now, i will modify my code accordingly

nhamilakis commented 1 year ago

After running the evaluation, the output in the Benchmark module looks something like this

With context_mode="all", and speaker_mode="all".

subset	speaker_mode	context_mode	granularity	score	item_file	pooling	seed
dev-clean	within	within	triphone	0.4932	triphone-dev-clean.item	none	3459
dev-clean	across	within	triphone	0.4981	triphone-dev-clean.item	none	3459
dev-other	within	within	triphone	0.4985	triphone-dev-other.item	none	3459
dev-other	across	within	triphone	0.4998	triphone-dev-other.item	none	3459
test-clean	within	within	triphone	0.4970	triphone-test-clean.item	none	3459
test-clean	across	within	triphone	0.4996	triphone-test-clean.item	none	3459
test-other	within	within	triphone	0.4926	triphone-test-other.item	none	3459
test-other	across	within	triphone	0.5010	triphone-test-other.item	none	3459
dev-clean	within	within	phoneme	0.5001	phoneme-dev-clean.item	none	3459
dev-clean	across	within	phoneme	0.4983	phoneme-dev-clean.item	none	3459
dev-other	within	within	phoneme	0.5136	phoneme-dev-other.item	none	3459
dev-other	across	within	phoneme	0.4994	phoneme-dev-other.item	none	3459
test-clean	within	within	phoneme	0.5178	phoneme-test-clean.item	none	3459
test-clean	across	within	phoneme	0.4979	phoneme-test-clean.item	none	3459
test-other	within	within	phoneme	0.5000	phoneme-test-other.item	none	3459
test-other	across	within	phoneme	0.4990	phoneme-test-other.item	none	3459
dev-clean	within	any	phoneme	0.5005	phoneme-dev-clean.item	none	3459
dev-clean	across	any	phoneme	0.5004	phoneme-dev-clean.item	none	3459
dev-other	within	any	phoneme	0.5044	phoneme-dev-other.item	none	3459
dev-other	across	any	phoneme	0.5002	phoneme-dev-other.item	none	3459
test-clean	within	any	phoneme	0.4978	phoneme-test-clean.item	none	3459
test-clean	across	any	phoneme	0.5000	phoneme-test-clean.item	none	3459
test-other	within	any	phoneme	0.5027	phoneme-test-other.item	none	3459
test-other	across	any	phoneme	0.4989	phoneme-test-other.item	none	3459

As said before, the options for context are : triphone-within,phoneme-within,phoneme-any,all The options for speaker mode are the same as before. I also left the possibility for the user to set the seed via the params.yaml file. The item file & granularity column are a bit redundant, not sure if i will keep them both. Also, it does not mean that all this information is in the final leaderboard we can still choose to show or not, i kept what i think is useful for the user at the level of the Benchmark module. Most of the other parameters are also in the params.yaml file.

If you @ewan & @hallapmark agree with this, i will go ahead and merge the branch.

hallapmark commented 1 year ago

Yes, maybe the item_file column is redundant, granularity should be enough, but it's ok either way. Otherwise, this looks good to me!

ewan commented 1 year ago

This looks fine to me. I would keep the granularity column.

From: Hamilakis Nicolas @.> Sent: January 4, 2023 10:07 AM To: zerospeech/benchmarks @.> Cc: Ewan Dunbar @.>; Mention @.> Subject: Re: [zerospeech/benchmarks] Remove abxLS replace it with abxLSPhon (Issue #12)

After running the evaluation, the output in the Benchmark module looks something like this :

subset speaker_mode context_mode granularity score item_file pooling seed dev-clean within within triphone 0.4932 triphone-dev-clean.item none 3459 dev-clean across within triphone 0.4981 triphone-dev-clean.item none 3459 dev-other within within triphone 0.4985 triphone-dev-other.item none 3459 dev-other across within triphone 0.4998 triphone-dev-other.item none 3459 test-clean within within triphone 0.4970 triphone-test-clean.item none 3459 test-clean across within triphone 0.4996 triphone-test-clean.item none 3459 test-other within within triphone 0.4926 triphone-test-other.item none 3459 test-other across within triphone 0.5010 triphone-test-other.item none 3459 dev-clean within within phoneme 0.5001 phoneme-dev-clean.item none 3459 dev-clean across within phoneme 0.4983 phoneme-dev-clean.item none 3459 dev-other within within phoneme 0.5136 phoneme-dev-other.item none 3459 dev-other across within phoneme 0.4994 phoneme-dev-other.item none 3459 test-clean within within phoneme 0.5178 phoneme-test-clean.item none 3459 test-clean across within phoneme 0.4979 phoneme-test-clean.item none 3459 test-other within within phoneme 0.5000 phoneme-test-other.item none 3459 test-other across within phoneme 0.4990 phoneme-test-other.item none 3459 dev-clean within any phoneme 0.5005 phoneme-dev-clean.item none 3459 dev-clean across any phoneme 0.5004 phoneme-dev-clean.item none 3459 dev-other within any phoneme 0.5044 phoneme-dev-other.item none 3459 dev-other across any phoneme 0.5002 phoneme-dev-other.item none 3459 test-clean within any phoneme 0.4978 phoneme-test-clean.item none 3459 test-clean across any phoneme 0.5000 phoneme-test-clean.item none 3459 test-other within any phoneme 0.5027 phoneme-test-other.item none 3459 test-other across any phoneme 0.4989 phoneme-test-other.item none 3459

— Reply to this email directly, view it on GitHubhttps://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fzerospeech%2Fbenchmarks%2Fissues%2F12%23issuecomment-1371046360&data=05%7C01%7Cewan.dunbar%40utoronto.ca%7C51b507fb2ba447d04b1808daee656039%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638084416408069169%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=S6bl937i9VXx4OCxaSqCoDcXrlukuvCYpkT8pmglyIQ%3D&reserved=0, or unsubscribehttps://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAA4DUMTZ54JRX7A4OEGDM3WQWG2PANCNFSM6AAAAAATPPXBUI&data=05%7C01%7Cewan.dunbar%40utoronto.ca%7C51b507fb2ba447d04b1808daee656039%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638084416408069169%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AwJwxdnO4FL47AooVRzgsGedNwIZK7n1FR0jsocEsuk%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

zerospeech / benchmarks

Remove abxLS replace it with abxLSPhon #12