westes / flex

The Fast Lexical Analyzer - scanner generator for lexing in C and C++
Other
3.61k stars 537 forks source link

mkskel.sh & Apple sed #294

Open stek29 opened 6 years ago

stek29 commented 6 years ago

Since 3f2b9a4 it only works with GNU sed.

On Apple sed (BSD too?) it breaks lines at r, producing invalid C file.

jannick0 commented 6 years ago

Ouutsh - This is getting tricky here, since ´mkskel.sh´ envokes the shell (incl. the internal field separator), sed and m4 with all their supposed EOL treatment (inherited from the compilation?), where the input file could follow the OS's EOL standard or it has converted EOLs (which might happen by, e.g., cloning with git).

@westes I should admit that this is a little bit beyond my knowledge of sed and company, in particular when it comes to OS cross-overs, since I am sitting in front of a Windows box using coreutils shipped by cygwin or msys which is always a bit of a stretch when it comes to a consistent EOL treatment.

stek29 commented 6 years ago

Probably using perl at least makes sense, or doing tr '\r' '\n' before sed

westes commented 6 years ago

No to perl.

The tr command is probably wrong in the general case but may be ok in inputs we care about.

On Thursday, 14 December 2017, 10:48 pm +0000, Viktor Oreshkin notifications@github.com wrote:

Probably using perl at least makes sense, or doing tr '\r' '\n' before sed

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/westes/flex/issues/294#issuecomment-351859895

-- Will Estes westes575@gmail.com

westes commented 6 years ago

Yeah it is kind of a mess, unfortunately.

On Thursday, 14 December 2017, 10:43 pm +0000, jannick0 notifications@github.com wrote:

Ouutsh - This is getting tricky here, since ´mkskel.sh´ envokes the shell (incl. the internal field separator), sed and m4 with all their supposed EOL treatment (inherited from the compilation?), where the input file could follow the OS's EOL standard or it has converted EOLs (which might happen by, e.g., cloning with git).

@westes I should admit that this is a little bit beyond my knowledge of sed and company, in particular when it comes to OS cross-overs, since I am sitting in front of a Windows box using coreutils shipped by cygwin or msys which is always a bit of a stretch when it comes to a consistent EOL treatment.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/westes/flex/issues/294#issuecomment-351858885

-- Will Estes westes575@gmail.com

jannick0 commented 6 years ago

What about something like sed ':a;N;$!ba;s/(\r\n|\r)/\n/g' using sed address ranges to normalize EOLs at some stage(s) of mkskel.sh?

westes commented 6 years ago

You'd also have to remember what the original state of the file is so that you can write it back in the way the caller expects, I think.

On Friday, 15 December 2017, 7:14 am -0800, jannick0 notifications@github.com wrote:

What about something like sed ':a;N;$!ba;s/(\r\n|\r)/\n/g' using sed address ranges to normalize EOLs at some stage(s) of mkskel.sh?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/westes/flex/issues/294#issuecomment-352030029

-- Will Estes westes575@gmail.com

jannick0 commented 6 years ago

Umm - then what about using gawk to remember the EOL structure of flex.skl as is?

# mkskel.awk
# sample call: gawk -f ./mskel.awk flex.skl > skel1.c

BEGIN{
    oRS = RS
    RS = "\f"   # or '\v'; any character which is rare or even not contained in the input stream / file
            # such that gawk slurps the input stream ideally in one single step 
    lines = ""
    dbg = 0
    #dbg = 1
}

{
    lines = lines == "" ? $0 : lines RS $0
    c++
}

END{
    if ( dbg )
        print "input stream read in " c " step(s)" > "/dev/stderr"

    if ( lines == "" )
    {
        print "no lines from input file / stream read" > "/dev/stderr"
        exit 1
    }

    # compose string of char array skel  
    # where input lines are concatenated with original EOLs
    s = "/* File created from flex.skl via mkskel.sh */" oRS oRS
    s = s "#include \"flexdef.h\"" oRS oRS
    s = s "const char *skel[] = {" oRS

    # aEOL non-POSIX
    n = split(lines, aLine, "\r\n|\r|\n", aEOL )
    for ( i = 1; i <= n; i++)
        s = s "\t\"" aLine[i] "\"," ( i < n ? aEOL[i] :  "" )

    s = s oRS "\t0" oRS "};"

    print s
}
westes commented 6 years ago

We can't assume it's GNU awk.

But if some fairly generic awk will do that, then I'm open to it.

And even some linux distributions have some pretty abominable excuses calling themselves "awk", so it's not just a BSD/OSX thing.

On Saturday, 16 December 2017, 12:28 am +0000, jannick0 notifications@github.com wrote:

Umm - then what about using gawk to remember the EOL structure of flex.skl as is?

# mkskel.awk
# sample call: gawk -f ./mskel.awk flex.skl > skel1.c

BEGIN{
  oRS = RS
  RS = "\f"   # or '\v'; any character which is rare or even not contained in the input stream / file
          # such that gawk slurps the input stream ideally in one single step 
  lines = ""
  dbg = 0
  #dbg = 1
}

{
  lines = lines == "" ? $0 : lines RS $0
  c++
}

END{
  if ( dbg )
      print "input stream read in " c " step(s)" > "/dev/stderr"

  if ( lines == "" )
  {
      print "no lines from input file / stream read" > "/dev/stderr"
      exit 1
  }

  # compose string of char array skel  
  # where input lines are concatenated with original EOLs
  s = "/* File created from flex.skl via mkskel.sh */" oRS oRS
  s = s "#include \"flexdef.h\"" oRS oRS
  s = s "const char *skel[] = {" oRS

  # aEOL non-POSIX
  n = split(lines, aLine, "\r\n|\r|\n", aEOL )
  for ( i = 1; i <= n; i++)
      s = s "\t\"" aLine[i] "\"," ( i < n ? aEOL[i] :  "" )

  s = s oRS "\t0" oRS "};"

  print s
}

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/westes/flex/issues/294#issuecomment-352144524

-- Will Estes westes575@gmail.com

jannick0 commented 6 years ago

Ok, in this package mkskel.zip I tried to put together a POSIX compliant awk script which should do the trick that output EOL are identical to either input file EOL unless given on the awk command line.

Additional notes:

@westes ... and as always please do feel free to amend as you might find appropriate. But I hope that helps.

westes commented 6 years ago

Thanks. I'll have a look. Most likely after 2.6.5 is released which is next on my flex todo list, but we'll see how things go.

On Sunday, 17 December 2017, 8:00 am -0800, jannick0 notifications@github.com wrote:

Ok, in this package mkskel.zip I tried to put together a POSIX compliant awk script which should do the trick that output EOL are identical to either input file EOL unless given on the awk command line.

Additional notes:

  • EOL consistency check for input file (if EOL not provided on command line, i.e. from outside of the script)
  • the awk script could replace mkskle.sh, thus it could make m4 obsolete for the preprocessing step. For this the only m4preproc define M4_GEN_PREFIX is migrated to a awk function. VERSION number mandatory on the awk command line.
  • POSIX compliance checked with gawk --posix (or gawk -P)
  • the package contains a makefile to check any differences between the output of mkskel.sh and mkskel.awk after running against flex.skl. Here I see the additional header line with the date stamp and quotation issues in c-comments, thus effectively no differences with impact on flex code
  • the current version of the script process flex.skl as it stands right now. TODOs in the script indicate where code could be removed or amended if corresponding changes in flex.skl were applied; this could shrink the code quite a bit I would expect.
  • the output file type is governed by the version of awk used which I think is not important here, since c compilers do not care about the nasty EOL issue I would hope.

@westes ... and as always please do feel free to amend as you might find appropriate. But I hope that helps.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/westes/flex/issues/294#issuecomment-352265659

-- Will Estes westes575@gmail.com

Explorer09 commented 6 years ago

Excuse me, but what was this issue about? Was it only about [^\r] incompatibility or was it something more? I think the fix should be easy—no need to bother with awk or perl. As I experimented with sed syntax when working with PR #321, I think I can take this one.

But here's one thing I need to know first: Which EOL (end of line) convention are we expecting for flex.skl ? LF only, CR+LF, or CR, or do we accept all three?

westes commented 6 years ago

In theory we accept any line termination at all.

In practice, flex is built in an ubuntu container (although at some point i'll get the build to run in osx container as well because travis offers that feature). The *BSD folks who are also contributors to flex use standard LF line termination.

On Monday, 23 April 2018, 6:15 pm -0700, "Kang-Che Sung (宋岡哲)" notifications@github.com wrote:

Excuse me, but what was this issue about? Was it only about [^\r] incompatibility or was it something more? I think the fix should be easy—no need to bother with awk or perl. As I experimented with sed syntax when working with PR #321, I think I can take this one.

But here's one thing I need to know first: Which EOL (end of line) convention are we expecting for flex.skl ? LF only, CR+LF, or CR, or do we accept all three?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/westes/flex/issues/294#issuecomment-383771465

-- Will Estes westes575@gmail.com

wendajiang commented 2 years ago

Maybe this issue #539