wikitables / web-of-data-tables

Project for tables on the Web
3 stars 0 forks source link

Changes in Wikipedia's tables #3

Closed emir-munoz closed 6 years ago

emir-munoz commented 6 years ago

Wikipedia has updated the way how some tables are stored in Wikimedia format. It has created more templates that render the tables in different ways to make them look nicer. Because of that, it makes more sense to retrieve the tables from the HTML files and not from the Wikimedia dumps.

There are also more parsers for HTML than for Wikimedia format. However, we will have to crawl the Wikipedia articles since there is no HTML dump of Wikipedia.

Having said that, I checked the HTML source of the Wikipedia article Liverpool_F.C., where I could find the following examples of tables.

A nice example of a table with team players and data about their country and position. This table does not have any class in the HTML source that helps us to differentiate from other tables. selection_049

A second example of a table, where there is an explicit mention of the class wikitable and of something else called alternance. selection_050

Finally, another example of a table, where there is an explicit mention of the class wikitable plus some other classes plainrowheaders sortable jquery-tablesorter. selection_051

This brings me to the question what type of tables will we consider in this research. We should try to do some reading about the different types of tables in Wikipedia and try to find some examples that we would like to work with.

emir-munoz commented 6 years ago

Forgot to mention that now Wikipedia has created separated articles for some relevant lists (shown as tables). For example, the List of cities in Italy. An option could be to limit to only this type of articles (pages).

aidhog commented 6 years ago

Okay, it seems that things have changed a bit in the last few years.

First of all, this might be an opportunity. Surely we will not be the last people to wish to extract tables from Wikipedia, so if we can create and describe a clean framework/tool for such and make it reusable, that might be a very nice practical contribution!

For classes of tables, I propose we postpone that issue by simply extracting all tables (remembering the class labels) and later decide if one particular class is just noise or not. More generally, I propose to extract as many tables as possible from as many articles as possible as a first step and worry about filtering them later.

The main question then seems to be how to get a recent dump of Wikipedia in HTML. I would be very reluctant to crawl, since if delay let's say 1 second (which is not much) between each request, that already puts us into the ballpark of having to wait two months for 5 million pages. Plus I guess there's a lot of parsers out there: https://www.mediawiki.org/wiki/Alternative_parsers. Even if some of the Wikipedia structure has changed, we only need something that would be able to extract the tables. Another possibility might be to look at writing a parser (potentially quick and dirty) to get the tables from the Wikipedia dump entirely. Given the time that would be taken to crawl, and how difficult it would be to update the data later, I would say we should certainly prefer to use a dump!

emir-munoz commented 6 years ago

For classes of tables, I propose we postpone that issue by simply extracting all tables [...]

I agree with this, we never know what other interesting things we can do with the other tables. For instance, 5 years ago, I started to research the (n x 2) tables, which are infobox style. I never finished that, though.

The main question then seems to be how to get a recent dump of Wikipedia in HTML. [...] I would say we should certainly prefer to use a dump!

I also agree with this one. Before we had a parser (in Java) for the XML dump to convert the pages to HTML. However, at that point, they have already started changing the templates of tables. Since then, they have been making more templates that we need to find to know how to parse and interpret a table if we go for the dump option. For instance, the following Wikimedia formatted text generates the table with the players in the https://en.wikipedia.org/wiki/Liverpool_F.C. article.

{{Fs start}}
{{Fs player|no= 1|nat=GER|name=[[Loris Karius]]|pos=GK}}
{{Fs player|no= 2|nat=ENG|name=[[Nathaniel Clyne]]|pos=DF}}
{{Fs player|no= 4|nat=NED|name=[[Virgil van Dijk]]|pos=DF}}
{{Fs player|no= 5|nat=NED|name=[[Georginio Wijnaldum]]|pos=MF}}
{{Fs player|no= 6|nat=CRO|name=[[Dejan Lovren]]|pos=DF}}
{{Fs player|no= 7|nat=ENG|name=[[James Milner]]|pos=MF|other=[[Captain (association football)#Vice-captain|vice-captain]]}}
{{Fs player|no= 9|nat=BRA|name=[[Roberto Firmino]]|pos=FW}}
{{Fs player|no=11|nat=EGY|name=[[Mohamed Salah]]|pos=FW}}
{{Fs player|no=12|nat=ENG|name=[[Joe Gomez (footballer)|Joe Gomez]]|pos=DF}}
{{Fs player|no=14|nat=ENG|name=[[Jordan Henderson]]|pos=MF|other=[[Captain (association football)|captain]]}}
{{Fs player|no=17|nat=EST|name=[[Ragnar Klavan]]|pos=DF}}
{{fs player|no=18|nat=ESP|name=[[Alberto Moreno]]|pos=DF}}
{{Fs player|no=19|nat=SEN|name=[[Sadio Mané]]|pos=FW}}
{{Fs mid}}
{{Fs player|no=20|nat=ENG|name=[[Adam Lallana]]|pos=MF}}
{{Fs player|no=21|nat=ENG|name=[[Alex Oxlade-Chamberlain]]|pos=MF}}
{{Fs player|no=22|nat=BEL|name=[[Simon Mignolet]]|pos=GK}}
{{Fs player|no=23|nat=GER|name=[[Emre Can]]|pos=MF}}
{{Fs player|no=26|nat=SCO|name=[[Andrew Robertson (footballer)|Andrew Robertson]]|pos=DF}}
{{Fs player|no=28|nat=ENG|name=[[Danny Ings]]|pos=FW}}
{{Fs player|no=29|nat=ENG|name=[[Dominic Solanke]]|pos=FW}}
{{Fs player|no=32|nat=CMR|name=[[Joël Matip]]|pos=DF}}
{{Fs player|no=34|nat=HUN|name=[[Ádám Bogdán]]|pos=GK}}
{{Fs player|no=52|nat=WAL|name=[[Danny Ward (Welsh footballer)|Danny Ward]]|pos=GK}}
<!--{{Fs player|no=54|nat=ENG|name=[[Sheyi Ojo]]|pos=MF}}-->
<!--{{Fs player|no=56|nat=ENG|name=[[Connor Randall]]|pos=DF}}-->
{{Fs player|no=58|nat=WAL|name=[[Ben Woodburn]]|pos=FW}}
<!--{{Fs player|no=60|nat=BRA|name=[[Allan (footballer, born 1997)|Allan]]|pos=MF}}-->
{{Fs player|no=66|nat=ENG|name=[[Trent Alexander-Arnold]]|pos=DF}}
<!--{{Fs player|no=68|nat=ESP|name=[[Pedro Chirivella]]|pos=MF}}-->
<!--{{Fs player|no=—|nat=NGR|name=[[Taiwo Awoniyi]]|pos=FW}}-->
{{Fs end}}

In this specific parsing case, we need to find the mappings for countries and positions.

My take is that the core of the interpreter code should receive HTML, but we should start our pipeline from the XML dumps. I will try to find the "best" parser of Wikipedia XML dumps to extract HTML.

aidhog commented 6 years ago

Okay, sounds good! Yes the first task would be to try a wiki -> html parser, where it seems like info.bliki.wiki would be a good option (we've used it before and it was recently updated).

emir-munoz commented 6 years ago

This issue has been solved using the Wikipedia REST API. @aidhog got this lead as a reply to an email in the Wikipedia mailing list: https://lists.wikimedia.org/pipermail/wikitech-l/2018-May/089905.html