wmaier / treetools

Tools for processing treebank trees
GNU General Public License v3.0
19 stars 2 forks source link

crash with "cannot write a discontinuous trees with brackets." #1

Closed bnicenboim closed 9 years ago

bnicenboim commented 9 years ago

Really cool program!, just that it's crashing when it can't parse a tree. It'll be better if it can just ignore the illegal trees (unless there's a way to fix them).

$ ./treetools transform smultron4.0/es/smultron_es_squoia_ahk.xml  smultron_es_squoia_ahk.penn  --src-format tigerxml --dest-format brackets
reading from 'smultron4.0/es/smultron_es_squoia_ahk.xml' in format 'tigerxml' and encoding 'utf-8'
writing to 'smultron_es_squoia_ahk.penn' in format 'brackets' and encoding 'utf-8'
applying []
smultron4.0/es/smultron_es_squoia_ahk.xml --> smultron_es_squoia_ahk.penn
parsing xml...
reading sentences
Traceback (most recent call last):
  File "./treetools", line 27, in <module>
    main()
  File "./treetools", line 24, in main
    args.func(args)
  File "/home/bruno/Documents/Linguistics/Phd/myPapers/crossling/treetools-master/trees/transform.py", line 883, in run
    (args.dest_opts))
  File "/home/bruno/Documents/Linguistics/Phd/myPapers/crossling/treetools-master/trees/treeoutput.py", line 200, in brackets
    raise ValueError("cannot write a discontinuous trees with brackets.")
ValueError: cannot write a discontinuous trees with brackets.

I think that this is the illegal tree:

<s id="s76"><!-- encendiendo y apagando: CS ? -->
<graph root="s76_512"><terminals><t id="s76_1" word="Puede" pos="VM" morph="--" lemma=""/><t id="s76_2" word="determinar" pos="VM" morph="--" lemma=""/><t id="s76_3" word="si" pos="CjS" morph="--" lemma=""/><t id="s76_4" word="este" pos="DD" morph="--" lemma=""/><t id="s76_5" word="equipo" pos="NC" morph="--" lemma=""/><t id="s76_6" word="causa" pos="VM" morph="--" lemma=""/><t id="s76_7" word="interferencias" pos="NC" morph="--" lemma=""/><t id="s76_8" word="perjudiciales" pos="AQ" morph="--" lemma=""/><t id="s76_9" word="para" pos="SP" morph="--" lemma=""/><t id="s76_10" word="la" pos="DA" morph="--" lemma=""/><t id="s76_11" word="recepción" pos="NC" morph="--" lemma=""/><t id="s76_12" word="de" pos="SP" morph="--" lemma=""/><t id="s76_13" word="radio" pos="NC" morph="--" lemma=""/><t id="s76_14" word="o" pos="CC" morph="--" lemma=""/><t id="s76_15" word="televisión" pos="NC" morph="--" lemma=""/><t id="s76_16" word="encendiendo" pos="VM" morph="--" lemma=""/><t id="s76_17" word="y" pos="CC" morph="--" lemma=""/><t id="s76_18" word="apagando" pos="VM" morph="--" lemma=""/><t id="s76_19" word="el" pos="DA" morph="--" lemma=""/><t id="s76_20" word="equipo" pos="NC" morph="--" lemma=""/><t id="s76_21" word=";" pos="F" morph="--" lemma=""/></terminals><nonterminals><nt id="s76_500" cat="NP"><edge label="--" idref="s76_4"/><edge label="--" idref="s76_5"/></nt><nt id="s76_501" cat="AP"><edge label="--" idref="s76_8"/></nt><nt id="s76_502" cat="NP"><edge label="--" idref="s76_13"/></nt><nt id="s76_503" cat="NP"><edge label="--" idref="s76_15"/></nt><nt id="s76_504" cat="NP"><edge label="--" idref="s76_19"/><edge label="--" idref="s76_20"/></nt><nt id="s76_505" cat="NP"><edge label="--" idref="s76_7"/><edge label="--" idref="s76_501"/></nt><nt id="s76_506" cat="CNP"><edge label="--" idref="s76_14"/><edge label="--" idref="s76_502"/><edge label="--" idref="s76_503"/></nt><nt id="s76_507" cat="CS"><edge label="--" idref="s76_16"/><edge label="--" idref="s76_17"/><edge label="--" idref="s76_18"/><edge label="CD" idref="s76_504"/></nt><nt id="s76_508" cat="PP"><edge label="--" idref="s76_12"/><edge label="--" idref="s76_506"/></nt><nt id="s76_509" cat="NP"><edge label="--" idref="s76_10"/><edge label="--" idref="s76_11"/><edge label="--" idref="s76_508"/></nt><nt id="s76_510" cat="PP"><edge label="--" idref="s76_9"/><edge label="--" idref="s76_509"/></nt><nt id="s76_511" cat="S"><edge label="--" idref="s76_3"/><edge label="--" idref="s76_6"/><edge label="SUJ" idref="s76_500"/><edge label="CD" idref="s76_505"/><edge label="CI" idref="s76_510"/></nt><nt id="s76_512" cat="S"><edge label="--" idref="s76_1"/><edge label="--" idref="s76_2"/><edge label="--" idref="s76_21"/><edge label="CC" idref="s76_507"/><edge label="CD" idref="s76_511"/></nt></nonterminals></graph></s>

I'm new in treebanks, so I'm not sure what's exactly wrong. I don't know if the tree is "fixable" or if it should be ignored.

Bests! Bruno

wmaier commented 9 years ago

Thanks for your report. More gentle failing strategies are on my todo list! The error you saw means that it looks like there are crossing branches somewhere in the sentence annotation (which cannot be rendered as brackets). Unfortunately I can not reproduce this with the sentence you have pasted. Here, it converts just fine to

(VROOT(S(VM Puede)(VM determinar)(S(CjS si)(NP(DD este)(NC equipo))(VM causa)(NP(NC interferencias)(AP(AQ perjudiciales)))(PP(SP para)(NP(DA la)(NC recepción)(PP(SP de)(CNP(NP(NC radio))(CC o)(NP(NC televisión)))))))(CS(VM encendiendo)(CC y)(VM apagando)(NP(DA el)(NC equipo)))(F ;))).
bnicenboim commented 9 years ago

oh, maybe I pasted the wrong tree, I'll check it again tonight

bnicenboim commented 9 years ago

Sorry, this is the tree that made the program crash. Is there something to do or should I just ignore it?

    <s id="s1437">
      <graph root="s1437_515">
        <terminals>
          <t id="s1437_1" word="Sobre" pos="SP" morph="--"/>
          <t id="s1437_2" word="todo" pos="PI" morph="--"/>
          <t id="s1437_3" word="en" pos="SP" morph="--"/>
          <t id="s1437_4" word="tiempos" pos="NC" morph="--"/>
          <t id="s1437_5" word="difíciles" pos="AQ" morph="--"/>
          <t id="s1437_6" word="la" pos="DA" morph="--"/>
          <t id="s1437_7" word="cooperación" pos="NC" morph="--"/>
          <t id="s1437_8" word="internacional" pos="AQ" morph="--"/>
          <t id="s1437_9" word="asume" pos="VM" morph="--"/>
          <t id="s1437_10" word="un" pos="DI" morph="--"/>
          <t id="s1437_11" word="papel" pos="NC" morph="--"/>
          <t id="s1437_12" word="crucial" pos="AQ" morph="--"/>
          <t id="s1437_13" word="," pos="F" morph="--"/>
          <t id="s1437_14" word="ya" pos="RG" morph="--"/>
          <t id="s1437_15" word="que" pos="CjS" morph="--"/>
          <t id="s1437_16" word="ningún" pos="DI" morph="--"/>
          <t id="s1437_17" word="país" pos="NC" morph="--"/>
          <t id="s1437_18" word="del" pos="SP" morph="--"/>
          <t id="s1437_19" word="mundo" pos="NC" morph="--"/>
          <t id="s1437_20" word="puede" pos="VM" morph="--"/>
          <t id="s1437_21" word="afrontar" pos="VM" morph="--"/>
          <t id="s1437_22" word="la" pos="DA" morph="--"/>
          <t id="s1437_23" word="crisis" pos="NC" morph="--"/>
          <t id="s1437_24" word="por" pos="SP" morph="--"/>
          <t id="s1437_25" word="sí" pos="PrN" morph="--"/>
          <t id="s1437_26" word="solo" pos="AQ" morph="--"/>
          <t id="s1437_27" word="." pos="F$" morph="--"/>
        </terminals>
        <nonterminals>
          <nt id="s1437_500" cat="NP">
            <edge label="--" idref="s1437_2"/>
          </nt>
          <nt id="s1437_501" cat="NP">
            <edge label="--" idref="s1437_4"/>
            <edge label="--" idref="s1437_516"/>
          </nt>
          <nt id="s1437_502" cat="NP">
            <edge label="--" idref="s1437_6"/>
            <edge label="--" idref="s1437_7"/>
            <edge label="--" idref="s1437_517"/>
          </nt>
          <nt id="s1437_503" cat="NP">
            <edge label="--" idref="s1437_10"/>
            <edge label="--" idref="s1437_11"/>
            <edge label="--" idref="s1437_518"/>
          </nt>
          <nt id="s1437_504" cat="MTC">
            <edge label="--" idref="s1437_14"/>
            <edge label="--" idref="s1437_15"/>
          </nt>
          <nt id="s1437_505" cat="NP">
            <edge label="--" idref="s1437_19"/>
          </nt>
          <nt id="s1437_506" cat="NP">
            <edge label="--" idref="s1437_22"/>
            <edge label="--" idref="s1437_23"/>
          </nt>
          <nt id="s1437_507" cat="AP">
            <edge label="--" idref="s1437_26"/>
          </nt>
          <nt id="s1437_508" cat="PP">
            <edge label="--" idref="s1437_1"/>
            <edge label="--" idref="s1437_500"/>
          </nt>
          <nt id="s1437_509" cat="PP">
            <edge label="--" idref="s1437_18"/>
            <edge label="--" idref="s1437_505"/>
          </nt>
          <nt id="s1437_510" cat="NP">
            <edge label="--" idref="s1437_25"/>
            <edge label="--" idref="s1437_507"/>
          </nt>
          <nt id="s1437_511" cat="PP">
            <edge label="--" idref="s1437_3"/>
            <edge label="--" idref="s1437_519"/>
          </nt>
          <nt id="s1437_512" cat="NP">
            <edge label="--" idref="s1437_16"/>
            <edge label="--" idref="s1437_17"/>
            <edge label="--" idref="s1437_509"/>
          </nt>
          <nt id="s1437_513" cat="PP">
            <edge label="--" idref="s1437_24"/>
            <edge label="--" idref="s1437_510"/>
          </nt>
          <nt id="s1437_514" cat="S">
            <edge label="--" idref="s1437_20"/>
            <edge label="--" idref="s1437_21"/>
            <edge label="--" idref="s1437_504"/>
            <edge label="CD" idref="s1437_506"/>
            <edge label="SUJ" idref="s1437_512"/>
            <edge label="CC" idref="s1437_513"/>
          </nt>
          <nt id="s1437_515" cat="S">
            <edge label="--" idref="s1437_9"/>
            <edge label="--" idref="s1437_13"/>
            <edge label="--" idref="s1437_27"/>
            <edge label="SUJ" idref="s1437_502"/>
            <edge label="CD" idref="s1437_503"/>
            <edge label="CCT" idref="s1437_511"/>
            <edge label="AO" idref="s1437_514"/>
          </nt>
          <nt id="s1437_516" cat="AP">
            <edge label="--" idref="s1437_5"/>
          </nt>
          <nt id="s1437_517" cat="AP">
            <edge label="--" idref="s1437_8"/>
          </nt>
          <nt id="s1437_518" cat="AP">
            <edge label="--" idref="s1437_12"/>
          </nt>
          <nt id="s1437_519" cat="NP">
            <edge label="--" idref="s1437_501"/>
            <edge label="--" idref="s1437_508"/>
          </nt>
        </nonterminals>
      </graph>
    </s>
wmaier commented 9 years ago

The problem is the node with id s1437_511 which immediately dominates s1437_3 and s1437_519. s1437_519, however, dominates terminals left of s1437_3, i.e., s1437_1 and s1437_2. This results in crossing branches, and those cannot be represented with standard bracketing format. If you do not want to deal with crossing branches, you will have to either omit this tree, or resolve them. In this tree, you would have to attach, e.g., s1437_508 to s1437_519.

For the moment I have added an option to skip discontinuous trees during brackets output (instead of failing). Use --dest-opts brackets_skipdisco.