I am not sure if this is really an issue with the parser but perhaps an improvement request unless a solution is available when using this parser.
Consider a complex table such as the one for which I am providing source code in a file at https://ufile.io/tee2c
As you will see, this table, even though encapsulated in a single
tag, contains more than 1 tables. When I use the parser, the output I get is thus a consolidated output as expected. Is it possible for the parser to somehow recognize any valid tables inside a single table doc for such a complex table so that we can a separate output corresponding to each valid table section?
Here's the code I have but not sure how to parse the table at a more granular level as described above:
extractor = Extractor(str(table_doc))
extractor.parse()
list_of_lists = extractor.return_list()
new_list = []
for row in list_of_lists:
stripped_list = [item.strip() for item in row]
if len(set(stripped_list))==1 or any([x for x in stripped_list if len(x)>200]): # skip any item with > 200 characters, its probably some paragraph string and hence not a valid table item
continue
new_list.append(stripped_list)
If this is the case, maybe I shall write a function return the soup handle for each cell, and you can check if each cell itself is a table and recursively apply the table extractor.
It would be better if you can provide a concrete example with an expected return. Thanks~
Thanks @yuanxu-li, I am working on some alternative to split the table sections since it is probably specific to my data. In any case, facilitating a JSON format (like the csv,tsv you have) for the parsed output would be valuable.
I am not sure if this is really an issue with the parser but perhaps an improvement request unless a solution is available when using this parser.
Consider a complex table such as the one for which I am providing source code in a file at https://ufile.io/tee2c
As you will see, this table, even though encapsulated in a single
Here's the code I have but not sure how to parse the table at a more granular level as described above:
Bonus would be to facilitate parsing to json format.
@dr333 Sorry I cannot open the link you provided. It requires signup. Let me try to guess what your table looks like. So tables within tables? Such as
If this is the case, maybe I shall write a function return the soup handle for each cell, and you can check if each cell itself is a table and recursively apply the table extractor.
It would be better if you can provide a concrete example with an expected return. Thanks~
Thanks @yuanxu-li, I am working on some alternative to split the table sections since it is probably specific to my data. In any case, facilitating a JSON format (like the csv,tsv you have) for the parsed output would be valuable.
@dr333 again an example would help me understand your problem better, since I do not have your data.
An example could be like:
Please find an example html attached.
test.html.txt