Closed silviaegt closed 1 year ago
Hi @silviaegt training a ML algorithm requires data. So, institutions would additionally need to provide some or all of the reconciled candidate data (an opt-in choice for users) and not only the operation history of OpenRefine.
However, we have talked about having this potential future "opt-in" feature where selected reconciliation candidate data could be sent back to a service and be used in feedback loops by anyone interested in working on problems of improving reconciliation. We are discussing such use cases in the Reconciliation API specs here: https://github.com/reconciliation-api/specs/issues/30 So, its a 2 part story, 1. where reconciliation services ask for usage of reconciliation data from a user, 2. where the user gives their consent to the service and tools like OpenRefine would send the reconciled candidate data and other columns data that was used to make a decision by the user. With that extra data and information, then classification systems (lazy learners, eager learners) could begin to fit for certain models for various ML algorithms.
Hey @thadguidry,
so glad to hear this is already something you're thinking about!
No doubt full reconciled candidate data would be great, I was just thinking that perhaps having tons of files with reconciliation operations like "Rutgers University, New Brunswick" was manually matched to "Rutgers University–New Brunswick" or "Southern Illinois University at Carbondale" was matched to "Southern Illinois University Carbondale" could be useful.
I know lots of librarians have their operation history in their OpenRefine projects and an open call to collect those might not be that difficult? But hey, the important thing is that it is relevant for you guys 😄
Hi @silviaegt,
Yes I totally agree with you that it would make sense to leverage more the annotation effort. Making it easier to mutualize it would be very useful. Using the OpenRefine operation history could be one way, but I think it should be possible to come up with something more practical. I don't have tons of time to expand on this but it's on my radar :)
It's a nice idea but it's not within scope for this service - it would be more on OpenRefine's side.
Hey @wetneb,
First of all, thank you so much for this life-changing endpoint!
I use it so much I thought maybe, just maybe something like this could be useful:
Have you thought about creating a collective repository where GLAM institutions and others can dump their operation history (i.e. this exercise in which I matched US Universities) and train a ML algorithm or the like to get better matching results based on the probability that a certain string matches with a given QID in certain data contexts?
I think it would be nice to build collectively with what has taken so many sitting hours....