weallwegot / omscs_chatbot

Issues Repository for Georgia Tech OMSCS Chat bot
MIT License
1 stars 0 forks source link

add to the knowledge base by using gt omscs web pages #11

Closed weallwegot closed 7 years ago

weallwegot commented 7 years ago

From @weAllWeGot on April 1, 2017 14:50

how?

beautiful soup 4 and html parsing of some of the more important pages. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

this can be used for the following info:

_Copied from original issue: weAllWeGot/kbai_chatbot3#58

weallwegot commented 7 years ago

you can also use lxml

weallwegot commented 7 years ago

commit does gt omscs web pages for the classes

47fb903c365b2e27c9150a239b2dcde9a8ea2476

weallwegot commented 7 years ago

leaving open, so implementation of the non-specific class related shit can be in there too

weallwegot commented 7 years ago

also class related shits are weird when the site has things listed in non-paragraph tags. like lists following colons doesnt go over well

weallwegot commented 7 years ago

fixed the bad page parsing by making a while loop that stops when the next h4 element is reached. might need to add some stop limit of like 10 iterations in case things get weird or answers get too long. or they stop using h4 elements lol. 0dac9742c7dc66e29132efacc90f55805202598f

weallwegot commented 7 years ago

http://lxml.de/api/lxml.etree._Element-class.html because this documentation is so hard to find