Closed andkov closed 5 years ago
@andkov, thank you for the exposition. We can also have a few regex cheat sheets laminated for our purpose. I downloaded a few off the web yesterday. Seem very helpful!
Here's the subject for the easy copy/paste into http://regex101.com
AE320
AE69
AG384
AG58
AH108
AH414
bath.70
bath.75
bath.79
chair.70
chair.75
chair.79
cook.70
cook.75
cook.79
dem2000
dem2005
dem2009
dressing.70
dressing.75
dressing.79
feed.70
feed.75
feed.79
intage2000
intage2005
intage2009
srh.70
srh.75
srh.79
toilet.70
toilet.75
toilet.79
For the above subject, what would be the regex that says take the last two digits of the value, if they are preceded by a period, but if not, then return nothing. Here's what i've got so far
^(\w+?).?(\d{2})$
don't forget to make the dot literal:
^(\w+?)\.?(\d{2})$
That's what I've been looking for. Thanks for the nudge. Here's the progress.
regex <- "^(\\w+?)\\.(\\d{2})$"
ds %>% dplyr::mutate(
age = variable, # copy existing
age = gsub(regex, "\\2", variable)
) %>% dplyr::arrange(variable)
variable age
1 AE320 AE320
2 AE69 AE69
3 AG384 AG384
4 AG58 AG58
5 AH108 AH108
6 AH414 AH414
7 bath.70 70
8 bath.75 75
9 bath.79 79
10 chair.70 70
11 chair.75 75
12 chair.79 79
13 cook.70 70
14 cook.75 75
15 cook.79 79
16 dem2000 dem2000
17 dem2005 dem2005
18 dem2009 dem2009
19 dressing.70 70
20 dressing.75 75
21 dressing.79 79
22 feed.70 70
23 feed.75 75
24 feed.79 79
25 intage2000 intage2000
26 intage2005 intage2005
27 intage2009 intage2009
28 srh.70 70
29 srh.75 75
30 srh.79 79
31 toilet.70 70
32 toilet.75 75
33 toilet.79 79
I'm concerned that there won't be an easy & general way to clearly encode all the info you want solely in the variable names. Regardless of regexes, I think it might be too complicated to cover so many scenarios.
In contrast, consider a metadata table, where each row is a variable in the real dataset. And each column in the metadata table are things like (a) domain, (b) age/year/duration_since_baseline.
This has the possibility of being more flexible to program against, and clearer to humans (esp the non-programmers).
excellent points.
I agree, a metadata is a more general and robust solution. I am already incorporating the type of metadata we developed for the domain mapping and now in the harmonization work, so i'm definitely will be looking into outsourcing some of the of complexity into the metadata. I can't yet comment on the distribution of labor among metadata and parsing script. Guess it might be something you feel up empirically.
Here's the full solution to the transformation. What can be optimized?
ds %>% dplyr::arrange(variable)
regex <- "^(\\w+?)\\.(\\d{2})$"
regex2 <- "^(AE|AG|AH)(\\d+?)$"
regex3 <- "^(\\w+?)(\\d{4})$"
d <- ds %>% dplyr::mutate(
age = variable, # copy existing
age = gsub(regex, "\\2", variable),
age = gsub(regex2, "\\1", age),
age = gsub(regex3, "\\2", age),
age = car::recode(age, " c('2000','AE')='70';
c('2005','AG')='75';
c('2009','AH')='79';"),
measure = gsub(regex, "\\1", variable),
measure = gsub(regex2, "\\2", measure),
measure = gsub(regex3, "\\1", measure),
measure = car::recode(measure, " c('320','384','414')='smoking';
c('69','58','108')='bmi' "),
year = age + 1930
) %>% dplyr::arrange(desc(variable))
(d)
variable age measure year
1 intage2009 79 intage 2009
2 intage2005 75 intage 2005
3 intage2000 70 intage 2000
4 feed.79 79 feed 2009
5 feed.75 75 feed 2005
6 feed.70 70 feed 2000
7 dressing.79 79 dressing 2009
8 dressing.75 75 dressing 2005
9 dressing.70 70 dressing 2000
10 dem2009 79 dem 2009
11 dem2005 75 dem 2005
12 dem2000 70 dem 2000
13 AH414 79 smoking 2009
14 AH108 79 bmi 2009
15 AG58 75 bmi 2005
16 AG384 75 smoking 2005
17 AE69 70 bmi 2000
18 AE320 70 smoking 2000
The following data saves all the complexity we want in a subject and is a bit more manageable
ds <- data.frame(variable = c(
"intage2000", "intage2005", "intage2009", # age
"dressing.70", "dressing.75", "dressing.79", # dressing
"feed.70", "feed.75", "feed.79", # feed
"dem2000", "dem2005", "dem2009", # dem
"AE69", "AG58", "AH108", #BMI
"AE320", "AG384", "AH414" #Smoking
)
)
I had the impression that a variable's time component would have only one of (a) age, (b) year, or (c) duration since baseline. Even if some variables in some studies have more than one of those qualities, I think it has to be built to work with zero of those components specified, since there will be static variables like DOB.
variable age measure year
1 intage2009 79 intage 2009
2 intage2005 75 intage 2005
3 intage2000 70 intage 2000
Normally, yes. And per variable. But because we've melted a collection of variables (the time component of which varied by variable: year, age (e.g. .79
), wave(e.g. AE
) we have a mix of these in the d_long$variable
during the wide-long-wide transformations.
@wibeasley , @Maleeha , @rpoudel , @casslbrown I would like to propose a little study in the use of regexes at the specific point in data grooming process, which occurs in different forms in almost any data preparation for analysis or graphing. It is based on the real life scenario I am work on right now.
Exposition
The subject of parsing are variable names stored in a column
ds$variable
. It is kept in R environment and a dataframe to keep it applied and ready for transferring solutions in to the live code. This also will contextualize the example indplyr::
. This data format is typical to occur during melting/stacking/elongating data set in some respect, in this case - with respect to time.Here we dealing with three waves of measures taken in 2000, 2005, and 2009. But sometimes, the waves were recorded with respect to age:
cook.70
,cook.75
,cook.79
(which is an item on the "activity of daily living" questionnaire among seniors). All subjects were born in 1932, so age is perfectly confounded with calendar year. However, sometimes thisage/wave
is expressed as a text prefex of the variable name:AE
,AG
,AH
for respective ages of 70, 75, and 79.Starting from the forms in #6 ,
dplyr::mutate
call parses two applicable values. The output appear below. The 'regex1:regex5` are helping to see the groups.Quest
The goal of the exercise is to design regexes to parse relevant item information before widening the dataset again, leaving
year
andage
in the long columns of the new, wide data set. However, I would like us to be mindful of two potential objectives:objective A
most efficient/elegant/robust regex solution for the given set of variables
objective B
considers less elegant regex solutions made work but slightly more verbose code. The precision of regexes that @wibeasley prefers looks very unfriendly to the folks like me. And while I'm gradually leaving this sad demographics of regex illiterates, I also need to consider my colleagues who will not have time to study the precise regexes enough to modify them, but would be willing to consider a hybrid, because R (and especially
dplyr::
syntax) is something they can still tolerate.So instead of the elegant and precise solution, I would like to have an option of using human-readable regexes like the example I gave in
regex1:regex5
. In other words, the second objective could be approximately formulated asI put the starter script in
./studies/dissecting-names/
and will tag this issue in its development.