wibeasley / class-regex-2015

material for our regex class
MIT License
1 stars 0 forks source link

Strategies for dissecting variable names #7

Closed andkov closed 5 years ago

andkov commented 8 years ago

@wibeasley , @Maleeha , @rpoudel , @casslbrown I would like to propose a little study in the use of regexes at the specific point in data grooming process, which occurs in different forms in almost any data preparation for analysis or graphing. It is based on the real life scenario I am work on right now.

Exposition

The subject of parsing are variable names stored in a column ds$variable. It is kept in R environment and a dataframe to keep it applied and ready for transferring solutions in to the live code. This also will contextualize the example in dplyr::. This data format is typical to occur during melting/stacking/elongating data set in some respect, in this case - with respect to time.

ds <- data.frame(variable = c(
  "intage2000", "intage2005", "intage2009", # age
  "cook.70", "cook.75", "cook.79", # cook
  "bath.70", "bath.75", "bath.79", # bath
  "dressing.70", "dressing.75", "dressing.79", # dressing
  "toilet.70", "toilet.75", "toilet.79", # toilet
  "feed.70", "feed.75", "feed.79", # feed
  "chair.70", "chair.75", "chair.79", # chair
  "dem2000", "dem2005", "dem2009", # dem
  "AE69", "AG58", "AH108", #BMI
  "srh.70", "srh.75", "srh.79", #Self-rated health
  "AE320", "AG384", "AH414"  #Smoking
  )
)
# ds %>% dplyr::arrange(variable)
regex1 <- "^intage|cook|bath|dressing|toilet|feed|chair|srh|dem"
regex2 <- "^AE|AG|AH" # letter indicators of age: AE=70, AG=75, AH=79
regex3 <- "2000|2005|2009$" # numeric indicator of year
regex4 <- ".70|.75|.79$" # numeric indicator of age
regex5 <- "69|58|108|320|384|414$" # item numbers, non-descriptive

regex <- "^(\\w+?)_?(\\d{4})$" # last four digits as year

d <- ds %>%
  dplyr::mutate(
    measure = gsub(regex, "\\1", variable),
    year = gsub(regex, "\\2", variable)
  )
d %>% dplyr::arrange(variable)

Here we dealing with three waves of measures taken in 2000, 2005, and 2009. But sometimes, the waves were recorded with respect to age: cook.70, cook.75,cook.79 (which is an item on the "activity of daily living" questionnaire among seniors). All subjects were born in 1932, so age is perfectly confounded with calendar year. However, sometimes this age/wave is expressed as a text prefex of the variable name: AE, AG, AH for respective ages of 70, 75, and 79.

Starting from the forms in #6 , dplyr::mutate call parses two applicable values. The output appear below. The 'regex1:regex5` are helping to see the groups.

      variable     measure        year
1        AE320       AE320       AE320
2         AE69        AE69        AE69
3        AG384       AG384       AG384
4         AG58        AG58        AG58
5        AH108       AH108       AH108
6        AH414       AH414       AH414
7      bath.70     bath.70     bath.70
8      bath.75     bath.75     bath.75
9      bath.79     bath.79     bath.79
10    chair.70    chair.70    chair.70
11    chair.75    chair.75    chair.75
12    chair.79    chair.79    chair.79
13     cook.70     cook.70     cook.70
14     cook.75     cook.75     cook.75
15     cook.79     cook.79     cook.79
16     dem2000         dem        2000
17     dem2005         dem        2005
18     dem2009         dem        2009
19 dressing.70 dressing.70 dressing.70
20 dressing.75 dressing.75 dressing.75
21 dressing.79 dressing.79 dressing.79
22     feed.70     feed.70     feed.70
23     feed.75     feed.75     feed.75
24     feed.79     feed.79     feed.79
25  intage2000      intage        2000
26  intage2005      intage        2005
27  intage2009      intage        2009
28      srh.70      srh.70      srh.70
29      srh.75      srh.75      srh.75
30      srh.79      srh.79      srh.79
31   toilet.70   toilet.70   toilet.70
32   toilet.75   toilet.75   toilet.75
33   toilet.79   toilet.79   toilet.79

Quest

The goal of the exercise is to design regexes to parse relevant item information before widening the dataset again, leaving year and age in the long columns of the new, wide data set. However, I would like us to be mindful of two potential objectives:

objective A

most efficient/elegant/robust regex solution for the given set of variables

objective B

considers less elegant regex solutions made work but slightly more verbose code. The precision of regexes that @wibeasley prefers looks very unfriendly to the folks like me. And while I'm gradually leaving this sad demographics of regex illiterates, I also need to consider my colleagues who will not have time to study the precise regexes enough to modify them, but would be willing to consider a hybrid, because R (and especially dplyr:: syntax) is something they can still tolerate.

So instead of the elegant and precise solution, I would like to have an option of using human-readable regexes like the example I gave in regex1:regex5. In other words, the second objective could be approximately formulated as

Can we accomplish the task using the structure of patterns in regex1:regex5 (or similar) and a clever use of dplyr:: syntax? What portion and part of regex complexity can we outsoure to dplyr::?

I put the starter script in ./studies/dissecting-names/ and will tag this issue in its development.

Maleeha commented 8 years ago

@andkov, thank you for the exposition. We can also have a few regex cheat sheets laminated for our purpose. I downloaded a few off the web yesterday. Seem very helpful!

andkov commented 8 years ago

Here's the subject for the easy copy/paste into http://regex101.com

AE320
AE69
AG384
AG58
AH108
AH414
bath.70
bath.75
bath.79
chair.70
chair.75
chair.79
cook.70
cook.75
cook.79
dem2000
dem2005
dem2009
dressing.70
dressing.75
dressing.79
feed.70
feed.75
feed.79
intage2000
intage2005
intage2009
srh.70
srh.75
srh.79
toilet.70
toilet.75
toilet.79
andkov commented 8 years ago

For the above subject, what would be the regex that says take the last two digits of the value, if they are preceded by a period, but if not, then return nothing. Here's what i've got so far

^(\w+?).?(\d{2})$
wibeasley commented 8 years ago

don't forget to make the dot literal: ^(\w+?)\.?(\d{2})$

andkov commented 8 years ago

That's what I've been looking for. Thanks for the nudge. Here's the progress. image

regex <- "^(\\w+?)\\.(\\d{2})$" 
ds %>% dplyr::mutate(
    age = variable, # copy existing
    age = gsub(regex, "\\2", variable)
  ) %>% dplyr::arrange(variable)

      variable        age
1        AE320      AE320
2         AE69       AE69
3        AG384      AG384
4         AG58       AG58
5        AH108      AH108
6        AH414      AH414
7      bath.70         70
8      bath.75         75
9      bath.79         79
10    chair.70         70
11    chair.75         75
12    chair.79         79
13     cook.70         70
14     cook.75         75
15     cook.79         79
16     dem2000    dem2000
17     dem2005    dem2005
18     dem2009    dem2009
19 dressing.70         70
20 dressing.75         75
21 dressing.79         79
22     feed.70         70
23     feed.75         75
24     feed.79         79
25  intage2000 intage2000
26  intage2005 intage2005
27  intage2009 intage2009
28      srh.70         70
29      srh.75         75
30      srh.79         79
31   toilet.70         70
32   toilet.75         75
33   toilet.79         79
wibeasley commented 8 years ago

I'm concerned that there won't be an easy & general way to clearly encode all the info you want solely in the variable names. Regardless of regexes, I think it might be too complicated to cover so many scenarios.

In contrast, consider a metadata table, where each row is a variable in the real dataset. And each column in the metadata table are things like (a) domain, (b) age/year/duration_since_baseline.

This has the possibility of being more flexible to program against, and clearer to humans (esp the non-programmers).

andkov commented 8 years ago

excellent points.

I agree, a metadata is a more general and robust solution. I am already incorporating the type of metadata we developed for the domain mapping and now in the harmonization work, so i'm definitely will be looking into outsourcing some of the of complexity into the metadata. I can't yet comment on the distribution of labor among metadata and parsing script. Guess it might be something you feel up empirically.

andkov commented 8 years ago

Here's the full solution to the transformation. What can be optimized?

ds %>% dplyr::arrange(variable)

regex <- "^(\\w+?)\\.(\\d{2})$" 
regex2 <- "^(AE|AG|AH)(\\d+?)$"
regex3 <- "^(\\w+?)(\\d{4})$"
d <- ds %>% dplyr::mutate(
  age = variable, # copy existing
  age = gsub(regex, "\\2", variable),
  age = gsub(regex2, "\\1", age),
  age = gsub(regex3, "\\2", age),
  age = car::recode(age, " c('2000','AE')='70';
                    c('2005','AG')='75';
                    c('2009','AH')='79';"),
  measure = gsub(regex, "\\1", variable), 
  measure = gsub(regex2, "\\2", measure),
  measure = gsub(regex3, "\\1", measure),
  measure = car::recode(measure, " c('320','384','414')='smoking';
                                   c('69','58','108')='bmi' "), 
                          year = age + 1930
  ) %>% dplyr::arrange(desc(variable))
  (d) 

      variable age  measure year
1   intage2009  79   intage 2009
2   intage2005  75   intage 2005
3   intage2000  70   intage 2000
4      feed.79  79     feed 2009
5      feed.75  75     feed 2005
6      feed.70  70     feed 2000
7  dressing.79  79 dressing 2009
8  dressing.75  75 dressing 2005
9  dressing.70  70 dressing 2000
10     dem2009  79      dem 2009
11     dem2005  75      dem 2005
12     dem2000  70      dem 2000
13       AH414  79  smoking 2009
14       AH108  79      bmi 2009
15        AG58  75      bmi 2005
16       AG384  75  smoking 2005
17        AE69  70      bmi 2000
18       AE320  70  smoking 2000
andkov commented 8 years ago

The following data saves all the complexity we want in a subject and is a bit more manageable

ds <- data.frame(variable = c(
  "intage2000", "intage2005", "intage2009", # age
  "dressing.70", "dressing.75", "dressing.79", # dressing
  "feed.70", "feed.75", "feed.79", # feed
  "dem2000", "dem2005", "dem2009", # dem
  "AE69", "AG58", "AH108", #BMI
  "AE320", "AG384", "AH414"  #Smoking
)
)
wibeasley commented 8 years ago

I had the impression that a variable's time component would have only one of (a) age, (b) year, or (c) duration since baseline. Even if some variables in some studies have more than one of those qualities, I think it has to be built to work with zero of those components specified, since there will be static variables like DOB.

      variable age  measure year
1   intage2009  79   intage 2009
2   intage2005  75   intage 2005
3   intage2000  70   intage 2000
andkov commented 8 years ago

Normally, yes. And per variable. But because we've melted a collection of variables (the time component of which varied by variable: year, age (e.g. .79), wave(e.g. AE) we have a mix of these in the d_long$variable during the wide-long-wide transformations.