yeemey / haackwell

0 stars 5 forks source link

Create a list of keywords for STATUS messages to better organize well data #3

Open chase-dwelle opened 7 years ago

chase-dwelle commented 7 years ago

In addition to the well functional binary (YES/NO), we also have status messages, e.g.,

Status:Not functional|Quantity:Dry|Quality:Soft Low yield|Normally operational Dry pan|No operation in the dry season No- broken down. Well polluted No- broken down. WATER TABLE HAS DROPED

So we need to figure out some of these keywords in order to make better categories of well failure conditions.

chase-dwelle commented 7 years ago

Based on Jimmy's work with NLTK on the status messages, we have a list of keywords that correspond to different failure modes: https://www.lucidchart.com/documents/edit/26a13991-a3a9-4fb2-8572-16b497b7e191?shared=true&

Environmental drivers: {'Reduced water table', 'lowered water table','drought', 'dry', 'dried', 'low yield', 'low flow', 'poor retention','water shortage','source', 'lack','dry season','jerican','jerry can', 'shallow','climatic','insufficient', 'quantity:insufficient'} Pollution: {'Salty', 'poorly sited', 'millky', 'coloured', 'contaminated', 'odour', 'smell', 'muddy', 'black', 'poor', 'dirty', 'silt', 'soil'} Potential human causes: {'Committee', 'WSC', 'fuel', 'theft', 'vandalised', 'stolen', 'beneficiaries', 'pay',' paid', 'funds', 'bill', 'people', 'personnel'} Mechanical causes: {'Pump', 'handle', 'pipes', 'tank', 'construction', 'cylinder', 'apron', 'repair', 'parts', 'installation', 'broken', 'blocked', 'technical'}

yeemey commented 7 years ago

Some words used to tag mechanical failures, (e.g. 'construction'), are applied to wells that are in fact working (e.g. 'STATUS' = 'Functional ( in use)|New Under construction').

Consider using bigrams? Or removing 'FUNC' = 'Yes' entries from consideration for mechanical failures?

chase-dwelle commented 7 years ago

I think it is fine to process them for now (if we have a MECH_FAIL column, have entries even if FUNC is yes). We can exclude the FUNC = Yes entries when we do failure analysis, then maybe next year's group can worry about cleaning up our data a little bit :)