Closed GoogleCodeExporter closed 8 years ago
Removing all non-alphanumeric characters (save spaces), the data set gets read
in fine.
It seems like there are some characters (` maybe?) that it can't handle. Is
there a list somewhere I can find of unsupported characters? Is this sounding
right to you?
Original comment by samuel.m...@gmail.com
on 17 Jul 2013 at 11:12
Can you print out the exact line that's failing? I don't know why it wouldn't
handle special characters.
Original comment by adpa...@google.com
on 17 Jul 2013 at 11:35
Do you mean the line in my ARPA file? I don't know, that's part of my problem.
Original comment by samuel.m...@gmail.com
on 17 Jul 2013 at 11:36
Here's the ARPA file that's not working.
Original comment by samuel.m...@gmail.com
on 17 Jul 2013 at 11:37
Attachments:
On line edu.berkeley.nlp.lm.io.ArpaLmReader.parseLine(ArpaLmReader.java:172),
add a print statement that prints the |line| if the array is less than the
number of spaces in |line| is less than |ngram.length|.
Original comment by adpa...@google.com
on 17 Jul 2013 at 11:49
Ah, I figured it out. My parser was giving my "words" that still had spaces in
them, so I would write a unigram that the parser was interpreting as a bigram.
(I am using stanford's NLP parser to parse files into sentences).
Your hint about spaces helped. Thanks!
Original comment by samuel.m...@gmail.com
on 18 Jul 2013 at 3:20
Original comment by adpa...@gmail.com
on 18 Jul 2013 at 3:30
Original issue reported on code.google.com by
samuel.m...@gmail.com
on 17 Jul 2013 at 10:48