After some weeks of research on wikitext parsers, looks like the java version for most of them is deprecated or out of maintenance. Also, doing our own parser with some generator providing our own lexer/grammar (i.e., ANTLR) is a no-go as the wikitext syntax has the same problem as Markdown and cannot be properly described.
Nevertheless, most of the Wikipedia/Wikitext tooling out there uses Sweble anyway with some workarounds. For example, dkpro-jwpl creates a shaded artifact to substitute the javax for the jakarta namespace to use a different version.
After some analysis of Sweble for our use-case, looks like we shouldn't use anyway the full engine but just the low-level parser (a PR with the PoC and/or branch is coming soon). This means that we might don't have the problem with jakarta and/or we just need to be sure that is not polluting our dependencies.
Thus, the idea is the following:
Use Sweble to parse the wikitext returned by the API calls to Yugipedia
Create visitors to simplify the extraction, cleanup and overall processing of the wikitext. This should help to:
Extract and use other templates without much more effort (i.e., see #104)
Explore non-template information to be included in models (i.e., currently in Wikipedia, the content of boosters and its distribution is on a wikitext section instead of the infobox template).
Currently, the following use-cases must be migrated to using a parser to simplify the code (and make sure that the expectations from unit/approval tests are still met):
[ ] Extract a template from wikitext and convert into a map of arguments (see WikitextTemplateMapper)
[ ] Parse some common wikitext markup and cleanup (see MarkupStringMapper and MarkupString). Note that in this case, the markup might be also mixed with some yugipedia specific syntax (i.e., some comma or semi-colon separated fields).
Discarded as this was already refactor in #122 as part of #120. We would try to go with the regex as it is simpler to implement, debug and the sweble parser is outdated and not maintained.
After some weeks of research on wikitext parsers, looks like the java version for most of them is deprecated or out of maintenance. Also, doing our own parser with some generator providing our own lexer/grammar (i.e., ANTLR) is a no-go as the wikitext syntax has the same problem as Markdown and cannot be properly described.
Nevertheless, most of the Wikipedia/Wikitext tooling out there uses Sweble anyway with some workarounds. For example, dkpro-jwpl creates a shaded artifact to substitute the javax for the jakarta namespace to use a different version.
After some analysis of Sweble for our use-case, looks like we shouldn't use anyway the full engine but just the low-level parser (a PR with the PoC and/or branch is coming soon). This means that we might don't have the problem with jakarta and/or we just need to be sure that is not polluting our dependencies.
Thus, the idea is the following:
Currently, the following use-cases must be migrated to using a parser to simplify the code (and make sure that the expectations from unit/approval tests are still met):
ruby
syntax (see (MarkupString)[https://github.com/ygojson/ygojson-tools/blob/main/tools/src/main/java/io/github/ygojson/tools/dataprovider/impl/yugipedia/model/wikitext/MarkupString.java] as before).