yuanxu-li / html-table-extractor

extract data from html table
MIT License
84 stars 22 forks source link

More granular parsing for some complex tables #4

Open dr333 opened 7 years ago

dr333 commented 7 years ago

I am not sure if this is really an issue with the parser but perhaps an improvement request unless a solution is available when using this parser.

Consider a complex table such as the one for which I am providing source code in a file at https://ufile.io/tee2c

As you will see, this table, even though encapsulated in a single

tag, contains more than 1 tables. When I use the parser, the output I get is thus a consolidated output as expected. Is it possible for the parser to somehow recognize any valid tables inside a single table doc for such a complex table so that we can a separate output corresponding to each valid table section?

Here's the code I have but not sure how to parse the table at a more granular level as described above:

extractor = Extractor(str(table_doc))
extractor.parse()
list_of_lists = extractor.return_list()
new_list = []  

for row in list_of_lists:
    stripped_list = [item.strip() for item in row]
    if len(set(stripped_list))==1 or any([x for x in stripped_list if len(x)>200]): # skip any item with > 200    characters, its probably some paragraph string and hence not a valid table item 
        continue
    new_list.append(stripped_list) 
dr333 commented 7 years ago

Bonus would be to facilitate parsing to json format.

yuanxu-li commented 7 years ago

@dr333 Sorry I cannot open the link you provided. It requires signup. Let me try to guess what your table looks like. So tables within tables? Such as

<table>
  <tr>
    <td>0</td>
    <td>
      <table>
        <tr>
          <td>1</td>
          <td>2</td>
        </tr>
      </table>
    </td>
  </tr>
</table>

If this is the case, maybe I shall write a function return the soup handle for each cell, and you can check if each cell itself is a table and recursively apply the table extractor.

It would be better if you can provide a concrete example with an expected return. Thanks~

dr333 commented 7 years ago

Thanks @yuanxu-li, I am working on some alternative to split the table sections since it is probably specific to my data. In any case, facilitating a JSON format (like the csv,tsv you have) for the parsed output would be valuable.

yuanxu-li commented 7 years ago

@dr333 again an example would help me understand your problem better, since I do not have your data.

An example could be like:

  1. a html table (such as the one I provided)
  2. the current output by the library
  3. your expected output
dr333 commented 7 years ago

Please find an example html attached.

test.html.txt