Selecting columns for a list breaks multiple matches

bobpaul commented 7 years ago

Maybe this can already be done and I'm just not getting it, but here's a contrived example to illulstrate.

Let's say I have some output of ps aux which looks like this:

$ ps aux 
message+   792  0.0  0.0  42892  3672 ?        Ss   11:33   0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root       839  0.0  0.1 274488  5924 ?        Ssl  11:33   0:00 /usr/lib/accountsservice/accounts-daemon
daemon     846  0.0  0.0  26044  2064 ?        Ss   11:33   0:00 /usr/sbin/atd -f
root      1003  0.0  0.0  13376   168 ?        Ss   11:33   0:00 /sbin/mdadm --monitor --pid-file /run/mdadm/monitor.pid --daemonise --scan --syslog
bobpaul     1318  0.0  0.1  21516  5224 pts/0    Ss   11:37   0:00 -bash
bobpaul     1339  0.0  4.5 676188 183092 ?       Ssl  11:38   0:18 emacs --daemon
bobpaul     1499  0.0  0.1  21568  5504 pts/1    Ss+  11:48   0:00 -bash
bobpaul     1512  0.0  0.1  21480  5420 pts/2    Ss+  11:48   0:00 -bash
bobpaul     2635  0.0  0.0  12944   936 pts/0    R+   19:03   0:00 grep --color=auto -e daemon -e bash
bobpaul     2636  0.0  0.0  21516  2104 pts/0    D+   19:03   0:00 -bash
$

Now, for all lines that contain bash I want to print the 5th column. For all lines that contain daemon I want to print the 2nd column. This can be done in awk like:

$ ps aux | awk '/daemon/ { print $2 } /bash/ { print $5 }'
792
839
846
1003
21516
1339
21568
21480
2635
12944
21516
$

So I try it to incrementally build the command with pyp... I start by matching both conditions, which after a bit of messing around, I figured out I could do with 'or'. (Maybe this is already abusive.)

$ ps aux | pyp "p.re('.*daemon.*').split() or p.re('.*bash.*').split()"
[[0]message+[1]792[2]0.0[3]0.0[4]42892[5]3672[6]?[7]Ss[8]11:33[9]0:00[10]/usr/bin/dbus-daemon[11]--system[12]--address=systemd:[13]--nofork[14]--nopidfile[15]--systemd-activation]
[[0]root[1]839[2]0.0[3]0.1[4]274488[5]5924[6]?[7]Ssl[8]11:33[9]0:00[10]/usr/lib/accountsservice/accounts-daemon]
[[0]daemon[1]846[2]0.0[3]0.0[4]26044[5]2064[6]?[7]Ss[8]11:33[9]0:00[10]/usr/sbin/atd[11]-f]
[[0]root[1]1003[2]0.0[3]0.0[4]13376[5]168[6]?[7]Ss[8]11:33[9]0:00[10]/sbin/mdadm[11]--monitor[12]--pid-file[13]/run/mdadm/monitor.pid[14]--daemonise[15]--scan[16]--syslog]
[[0]bobpaul[1]1318[2]0.0[3]0.1[4]21516[5]5224[6]pts/0[7]Ss[8]11:37[9]0:00[10]-bash]
[[0]bobpaul[1]1339[2]0.0[3]4.5[4]676188[5]183092[6]?[7]Ssl[8]11:38[9]0:18[10]emacs[11]--daemon]
[[0]bobpaul[1]1499[2]0.0[3]0.1[4]21568[5]5504[6]pts/1[7]Ss+[8]11:48[9]0:00[10]-bash]
[[0]bobpaul[1]1512[2]0.0[3]0.1[4]21480[5]5420[6]pts/2[7]Ss+[8]11:48[9]0:00[10]-bash]
[[0]bobpaul[1]2635[2]0.0[3]0.0[4]12944[5]936[6]pts/0[7]R+[8]19:03[9]0:00[10]grep[11]--color=auto[12]-e[13]daemon[14]-e[15]bash]
[[0]bobpaul[1]2636[2]0.0[3]0.0[4]21516[5]2104[6]pts/0[7]D+[8]19:03[9]0:00[10]-bash]
$

Good so far. And grab the columns (remember awk is 1 indexed, python is 0):

$ ps aux | pyp "p.re('.*daemon.*').split()[1] or p.re('.*bash.*').split()[4]"
792
839
846
1003
1339
2635
$

Wait, that's not enough results. It's only shows the columns for daemon matches. I think what's happening is the [1] selector must cause the first part to evaluate to True in cases where the regex didn't match (returned None). (None[1] would cause an exception, so part of the exception handling routine must make it always return True).

This becomes apparent if we remove the column selector from the daemon regex:

$ ps | pyp "p.re('.*daemon.*').split() or p.re('.*bash.*').split()[4]"
[[0]message+[1]792[2]0.0[3]0.0[4]42892[5]3672[6]?[7]Ss[8]11:33[9]0:00[10]/usr/bin/dbus-daemon[11]--system[12]--address=systemd:[13]--nofork[14]--nopidfile[15]--systemd-activation]
[[0]root[1]839[2]0.0[3]0.1[4]274488[5]5924[6]?[7]Ssl[8]11:33[9]0:00[10]/usr/lib/accountsservice/accounts-daemon]
[[0]daemon[1]846[2]0.0[3]0.0[4]26044[5]2064[6]?[7]Ss[8]11:33[9]0:00[10]/usr/sbin/atd[11]-f]
[[0]root[1]1003[2]0.0[3]0.0[4]13376[5]168[6]?[7]Ss[8]11:33[9]0:00[10]/sbin/mdadm[11]--monitor[12]--pid-file[13]/run/mdadm/monitor.pid[14]--daemonise[15]--scan[16]--syslog]
21516
[[0]bobpaul[1]1339[2]0.0[3]4.5[4]676188[5]183092[6]?[7]Ssl[8]11:38[9]0:18[10]emacs[11]--daemon]
21568
21480
[[0]bobpaul[1]2635[2]0.0[3]0.0[4]12944[5]936[6]pts/0[7]R+[8]19:03[9]0:00[10]grep[11]--color=auto[12]-e[13]daemon[14]-e[15]bash]
21516
$

Now it's returning both matches again, but only selecting columns on the second match.

Am I just approaching this problem the wrong way, or is it not currently possible to replicate the awk code that outputs a different column depending on what within the line matched?

zenlc2000 commented 7 years ago

My pyp is a little rusty - I don't get to use it as much as I'd like. I know how I'd do it with vanilla python:

# pseudocode
for line in stdin:
    if "bash" in line:
        print(line.split(' ')[4])
    elif "daemon" in line:
        print(line.split(' ')[1])

I'll have to spend a bit playing with pyp again to see how it would work.

zenlc2000 commented 7 years ago

I get a little bit closer if I only keep lines containing the strings you want. Now I get blanks for the second re.

$cat pyp_test.txt | ./pyp3 "'bash' in p or 'daemon' in p" | ./pyp3 "p.re('.daemon.').split()[1] or p.re('.bash.').split()[4]" 792 839 846 1003

1339

2635

I still kind of think this can be done as-is. Just need to think it through a bit more.

bobpaul commented 7 years ago

Oh, you gave me an idea and I got very close:

$ cat ps.txt | python2 pyp3 "keep('daemon') or keep('bash') | p.split()[1] if 'daemon' in p else p.split()[4] if 'bash' in p else ''"
792
839
846
1003
21516
1339
21568
21480
2635
21516
$ cat ps | awk '/daemon/ { print $2 } /bash/ { print $5 }'
792
839
846
1003
21516
1339
21568
21480
2635
12944
21516

The difference is there's one line that contains both bash and daemon. Awk is performing 2 independent IFs, whereas with the pyp statement above it's if-else.

zenlc2000 / pyp3

Selecting columns for a list breaks multiple matches #2