This is partly a bug and partly a feature and it was discovered when I ran the tool through a subset of Gates publications from DCP, in particular only 4 publications.
The way the fuzzy matcher works, it builds a vocabulary from the publications it is fed that it uses to calculate the cosine similarity with the references that are found in the policy domain. This has the implication that the vocabulary is dependent to the variety of words used in the publications that are fed to the tool and is not stable. For example, when 100K Wellcome publications are fed this creates a large enough vocabulary but when only handful publications are available then the vocabulary is very small and as a result, the tool finds noisy matches.
This is partly the result of the changing vocabulary and partly the result of the threshold not being optimized for such a small vocabulary.
I propose we revisit the fuzzy matcher to address this shortcoming and one solution I am thinking is to use a fixed vocabulary in addition to the vocabulary generated by the publications we feed in. Another way to mitigate this, is to filter the matches by calculating Levenstein distance after a match has been made which is more efficient than using Levenstein to match.
To give a more concrete example of the problem, imagine a title: "The evolution of Malaria" and a reference with a title: "Malaria had an evolution that is hard to understand and of great significance". Because most of the words from the first title are present in the second, while most words from the second title are not in the vocabulary the cosine similarity is very high.
This is partly a bug and partly a feature and it was discovered when I ran the tool through a subset of Gates publications from DCP, in particular only 4 publications.
The way the fuzzy matcher works, it builds a vocabulary from the publications it is fed that it uses to calculate the cosine similarity with the references that are found in the policy domain. This has the implication that the vocabulary is dependent to the variety of words used in the publications that are fed to the tool and is not stable. For example, when 100K Wellcome publications are fed this creates a large enough vocabulary but when only handful publications are available then the vocabulary is very small and as a result, the tool finds noisy matches.
This is partly the result of the changing vocabulary and partly the result of the threshold not being optimized for such a small vocabulary.
I propose we revisit the fuzzy matcher to address this shortcoming and one solution I am thinking is to use a fixed vocabulary in addition to the vocabulary generated by the publications we feed in. Another way to mitigate this, is to filter the matches by calculating Levenstein distance after a match has been made which is more efficient than using Levenstein to match.
To give a more concrete example of the problem, imagine a title: "The evolution of Malaria" and a reference with a title: "Malaria had an evolution that is hard to understand and of great significance". Because most of the words from the first title are present in the second, while most words from the second title are not in the vocabulary the cosine similarity is very high.