Parser Generator?

Jack

Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.

Aug 18 '07 #1

Subscribe Post Reply

1821

Diez B. Roggisch

Jack schrieb:

Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.

There are several options. I personally like spark.py, the most common
answer is pyparsing, and don't forget to check out NLTK, the natural
language toolkit.

Diez

Aug 18 '07 #2

beginner

On Aug 18, 5:22 pm, "Jack" <nos...@invalid.comwrote:

Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.

Antlr seems to be able to generate python code, too.

Aug 19 '07 #3

Tommy Nordgren

On 19 aug 2007, at 00.22, Jack wrote:

Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.
--
http://mail.python.org/mailman/listinfo/python-list

Antlr can generate Python code.
However, I don't think a parser generator is suitable for generating
natural language parsers.
They are intended to generate code for computer language parsers.
However, for examples on parsing imperative English sentences, I
suggest taking a look
at the class library for TADS 3 (Text Adventure Development System)
<http://www.tads.org>
The lanuge has a syntax reminding of c++ and Java.
-----------------------------------------------------
An astronomer to a colleague:
-I can't understsnad how you can go to the brothel as often as you
do. Not only is it a filthy habit, but it must cost a lot of money too.
-Thats no problem. I've got a big government grant for the study of
black holes.
Tommy Nordgren
to************@comhem.se

Aug 19 '07 #4

Jack

Thanks for all the replies!

SPARK looks promising. Its doc doesn't say if it handles unicode
(CJK in particular) encoding though.

Yapps also looks powerful: http://theory.stanford.edu/~amitp/yapps/

There's also PyGgy http://lava.net/~newsham/pyggy/

I may also give Antlr a try.

If anyone has experiences using any of the parser generators with CJK
languages, I'd be very interested in hearing that.

Jack
"Jack" <no****@invalid.comwrote in message
news:ab******************************@comcast.com. ..

Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.

Aug 19 '07 #5

samwyse

Jack wrote:

Thanks for all the replies!

SPARK looks promising. Its doc doesn't say if it handles unicode
(CJK in particular) encoding though.

Yapps also looks powerful: http://theory.stanford.edu/~amitp/yapps/

There's also PyGgy http://lava.net/~newsham/pyggy/

I may also give Antlr a try.

If anyone has experiences using any of the parser generators with CJK
languages, I'd be very interested in hearing that.

I'm going to echo Tommy's reply. If you want to parse natural language,
conventional parsers are going to be worse than useless (because you'll
keep thinking, "Just one more tweak and this time it'll work for
sure!"). Instead, go look at what the interactive fiction community
uses. They analyse the statement in multiple passes, first picking out
the verbs, then the noun phrases. Some of their parsers can do
on-the-fly domain-specific spelling correction, etc, and all of them can
ask the user for clarification. (I'm currently cobbling together
something similar for pre-teen users.)

Aug 19 '07 #6

Jack

Thanks for the suggestion. I understand that more work is needed for natural
language
understanding. What I want to do is actually very simple - I pre-screen the
user
typed text. If it's a simple syntax my code understands, like, Weather in
London, I'll
redirect it to a weather site. Or, if it's "What is ... " I'll probably
redirect it to wikipedia.
Otherwise, I'll throw it to a search engine. So, extremelyl simple stuff ...

"samwyse" <de******@email.comwrote in message
news:xH****************@nlpi068.nbdc.sbc.com...

Jack wrote:
>Thanks for all the replies!

SPARK looks promising. Its doc doesn't say if it handles unicode
(CJK in particular) encoding though.

Yapps also looks powerful: http://theory.stanford.edu/~amitp/yapps/

There's also PyGgy http://lava.net/~newsham/pyggy/

I may also give Antlr a try.

If anyone has experiences using any of the parser generators with CJK
languages, I'd be very interested in hearing that.

I'm going to echo Tommy's reply. If you want to parse natural language,
conventional parsers are going to be worse than useless (because you'll
keep thinking, "Just one more tweak and this time it'll work for sure!").
Instead, go look at what the interactive fiction community uses. They
analyse the statement in multiple passes, first picking out the verbs,
then the noun phrases. Some of their parsers can do on-the-fly
domain-specific spelling correction, etc, and all of them can ask the user
for clarification. (I'm currently cobbling together something similar for
pre-teen users.)

Aug 19 '07 #7

Alex Martelli

Jack <no****@invalid.comwrote:

Thanks for the suggestion. I understand that more work is needed for natural
language
understanding. What I want to do is actually very simple - I pre-screen the
user
typed text. If it's a simple syntax my code understands, like, Weather in
London, I'll
redirect it to a weather site. Or, if it's "What is ... " I'll probably
redirect it to wikipedia.
Otherwise, I'll throw it to a search engine. So, extremelyl simple stuff ...

<http://nltk.sourceforge.net/index.php/Main_Page>

"""
NLTK â€” the Natural Language Toolkit â€” is a suite of open source Python
modules, data sets and tutorials supporting research and development in
natural language processing.
"""
Alex

Aug 19 '07 #8

Jack

Very interesting work. Thanks for the link!

"Alex Martelli" <al***@mac.comwrote in message
news:1i**************************@mac.com...

<http://nltk.sourceforge.net/index.php/Main_Page>

"""
NLTK ¡ª the Natural Language Toolkit ¡ª is a suite of open source Python
modules, data sets and tutorials supporting research and development in
natural language processing.
"""
Alex

Aug 20 '07 #9

Jason Evans

On Aug 18, 3:22 pm, "Jack" <nos...@invalid.comwrote:

Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.

I use Parsing.py. I like it a lot, probably because I wrote it.

http://www.canonware.com/Parsing/

Jason

Aug 23 '07 #10

Jack

Thanks Jason. Does Parsing.py support Unicode characters (especially CJK)?
I'll take a look.

"Jason Evans" <jo*****@gmail.comwrote in message
news:11**********************@e9g2000prf.googlegro ups.com...

On Aug 18, 3:22 pm, "Jack" <nos...@invalid.comwrote:
>Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.

I use Parsing.py. I like it a lot, probably because I wrote it.

http://www.canonware.com/Parsing/

Jason

Aug 24 '07 #11

Paul McGuire

On Aug 18, 11:37 pm, "Jack" <nos...@invalid.comwrote:

Thanks for all the replies!

SPARK looks promising. Its doc doesn't say if it handles unicode
(CJK in particular) encoding though.

Yapps also looks powerful:http://theory.stanford.edu/~amitp/yapps/

There's also PyGgyhttp://lava.net/~newsham/pyggy/

I may also give Antlr a try.

If anyone has experiences using any of the parser generators with CJK
languages, I'd be very interested in hearing that.

Jack

"Jack" <nos...@invalid.comwrote in message

news:ab******************************@comcast.com. ..

Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.- Hide quoted text -

- Show quoted text -

Jack -

Pyparsing was already mentioned once on this thread. Here is an
application using pyparsing that parses Chinese characters to convert
to English Python.

http://pypi.python.org/pypi/zhpy/0.5

-- Paul

Aug 25 '07 #12

Jason Evans

On Aug 24, 1:21 pm, "Jack" <nos...@invalid.comwrote:

"Jason Evans" <joev...@gmail.comwrote in message
http://www.canonware.com/Parsing/

Thanks Jason. Does Parsing.py support Unicode characters (especially CJK)?
I'll take a look.

Parsers typically deal with tokens rather than individual characters,
so the scanner that creates the tokens is the main thing that Unicode
matters to. I have written Unicode-aware scanners for use with
Parsing-based parsers, with no problems. This is pretty easy to do,
since Python has built-in support for Unicode strings.

Jason

Aug 26 '07 #13

Jack

Good to know, thanks Paul.
!
"Paul McGuire" <pt***@austin.rr.comwrote in message

Pyparsing was already mentioned once on this thread. Here is an
application using pyparsing that parses Chinese characters to convert
to English Python.

http://pypi.python.org/pypi/zhpy/0.5

-- Paul

Aug 26 '07 #14

Jack

Thanks Json. There seem to be a few options that I can pursue. Having a hard
time
chooing one now :)

"Jason Evans" <jo*****@gmail.comwrote in message
news:11**********************@o80g2000hse.googlegr oups.com...

On Aug 24, 1:21 pm, "Jack" <nos...@invalid.comwrote:
>"Jason Evans" <joev...@gmail.comwrote in message
http://www.canonware.com/Parsing/

Thanks Jason. Does Parsing.py support Unicode characters (especially
CJK)?
I'll take a look.

Parsers typically deal with tokens rather than individual characters,
so the scanner that creates the tokens is the main thing that Unicode
matters to. I have written Unicode-aware scanners for use with
Parsing-based parsers, with no problems. This is pretty easy to do,
since Python has built-in support for Unicode strings.

Jason

Aug 26 '07 #15

Ryan Ginstrom

On Behalf Of Jason Evans

Parsers typically deal with tokens rather than individual
characters, so the scanner that creates the tokens is the
main thing that Unicode matters to. I have written
Unicode-aware scanners for use with Parsing-based parsers,
with no problems. This is pretty easy to do, since Python
has built-in support for Unicode strings.

The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have to pass the text
through a tokenizer (like ChaSen for Japanese) before using PyParsing.

Regards,
Ryan Ginstrom

Aug 27 '07 #16

Paul McGuire

On Aug 26, 8:05 pm, "Ryan Ginstrom" <softw...@ginstrom.comwrote:

On Behalf Of Jason Evans
Parsers typically deal with tokens rather than individual
characters, so the scanner that creates the tokens is the
main thing that Unicode matters to. I have written
Unicode-aware scanners for use with Parsing-based parsers,
with no problems. This is pretty easy to do, since Python
has built-in support for Unicode strings.

The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have to pass the text
through a tokenizer (like ChaSen for Japanese) before using PyParsing.

Regards,
Ryan Ginstrom

Did you think pyparsing is so mundane as to require spaces between
tokens? Pyparsing has been doing this type of token-recognition since
Day 1. Looking for tokens without delimiting spaces was one of the
first applications for pyparsing. This issue is not unique to Chinese
or Japanese text. Pyparsing will easily find the tokens in this
string:

y=a*x**2+b*x+c

as

['y','=','a','*','x','**','2','+','b','*','x','+',' c']

even though there is not a single delimiting space. But pyparsing
will also render this as a nested parse tree, reflecting the
precedence of operations:

['y', '=', [['a', '*', ['x', '**', 2]], '+',['b', '*', 'x'], '+',
'c']]

and will allow you to access individual tokens by field name:
- lhs: y
- rhs: [['a', '*', ['x', '**', 2]], '+', ['b', '*', 'x'], '+', 'c']

Please feel free to look through the posted examples on the pyparsing
wiki at http://pyparsing.wikispaces.com/Examples, or some of the
applications currently using pyparsing at http://pyparsing.wikispaces.com/WhosUsingPyparsing,
and you might get a better feel for what kind of tasks pyparsing is
capable of.

-- Paul

Aug 27 '07 #17

Steven Bethard

Paul McGuire wrote:

On Aug 26, 8:05 pm, "Ryan Ginstrom" <softw...@ginstrom.comwrote:
>The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have to pass the text
through a tokenizer (like ChaSen for Japanese) before using PyParsing.

Did you think pyparsing is so mundane as to require spaces between
tokens? Pyparsing has been doing this type of token-recognition since
Day 1. Looking for tokens without delimiting spaces was one of the
first applications for pyparsing. This issue is not unique to Chinese
or Japanese text. Pyparsing will easily find the tokens in this
string:

y=a*x**2+b*x+c

as

['y','=','a','*','x','**','2','+','b','*','x','+',' c']

The difference is that in the expression above (and in many other
tokenization problems) you can determine "word" boundaries by looking at
the class of character, e.g. alphanumeric vs. punctuation vs. whatever.

In Japanese and Chinese tokenization, word boundaries are not marked by
different classes of characters. They only exist in the mind of the
reader who knows which sequences of characters could be words given the
context, and which sequences of characters couldn't.

The closest analog would be to ask pyparsing to find the words in the
following sentence:

Thepyparsingmoduleprovidesalibraryofclassesthatcli entcodeusestoconstructthegrammardirectlyinPythonco de.

Most approaches that have been even marginally successful on these kinds
of tasks have used statistical machine learning approaches.

STeVe

Aug 27 '07 #18

Ryan Ginstrom

On Behalf Of Paul McGuire

>
On Aug 26, 8:05 pm, "Ryan Ginstrom" <softw...@ginstrom.comwrote:
The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have
to pass the
text through a tokenizer (like ChaSen for Japanese) before
using PyParsing.

Did you think pyparsing is so mundane as to require spaces
between tokens? Pyparsing has been doing this type of
token-recognition since Day 1.

Cool! I stand happily corrected. I did write "I think" because although I
couldn't find a way to do it, there might well actually be one <g>. I'll
keep looking to find some examples of parsing Japanese.

BTW, I think PyParsing is great, and I use it for several tasks. I just
could never figure out a way to use it with Japanese (at least on the
applications I had in mind).

Regards,
Ryan Ginstrom

Aug 27 '07 #19

Paul McGuire

On Aug 26, 10:48 pm, Steven Bethard <steven.beth...@gmail.comwrote:

Paul McGuire wrote:
On Aug 26, 8:05 pm, "Ryan Ginstrom" <softw...@ginstrom.comwrote:
The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have to pass the text
through a tokenizer (like ChaSen for Japanese) before using PyParsing.

Did you think pyparsing is so mundane as to require spaces between
tokens? Pyparsing has been doing this type of token-recognition since
Day 1. Looking for tokens without delimiting spaces was one of the
first applications for pyparsing. This issue is not unique to Chinese
or Japanese text. Pyparsing will easily find the tokens in this
string:

y=a*x**2+b*x+c

as

['y','=','a','*','x','**','2','+','b','*','x','+',' c']

The difference is that in the expression above (and in many other
tokenization problems) you can determine "word" boundaries by looking at
the class of character, e.g. alphanumeric vs. punctuation vs. whatever.

In Japanese and Chinese tokenization, word boundaries are not marked by
different classes of characters. They only exist in the mind of the
reader who knows which sequences of characters could be words given the
context, and which sequences of characters couldn't.

The closest analog would be to ask pyparsing to find the words in the
following sentence:

Thepyparsingmoduleprovidesalibraryofclassesthatcli entcodeusestoconstructthe*grammardirectlyinPythonc ode.

Most approaches that have been even marginally successful on these kinds
of tasks have used statistical machine learning approaches.

STeVe- Hide quoted text -

- Show quoted text -

Steve -

You mean like this?

from pyparsing import *

knownWords = ['of', 'grammar', 'construct', 'classes', 'a',
'client', 'pyparsing', 'directly', 'the', 'module', 'uses',
'that', 'in', 'python', 'library', 'provides', 'code', 'to']

knownWord = oneOf( knownWords, caseless=True )
sentence = OneOrMore( knownWord ) + "."

mush =
"Thepyparsingmoduleprovidesalibraryofclassesthatcl ientcodeusestoconstructthegrammardirectlyinPythonc ode."

print sentence.parseString( mush )

prints:

['the', 'pyparsing', 'module', 'provides', 'a', 'library', 'of',
'classes', 'that', 'client', 'code', 'uses', 'to', 'construct',
'the', 'grammar', 'directly', 'in', 'python', 'code', '.']

In fact, this is almost the exact scheme used by Zhpy for extracting
Chinese versions of Python keywords, and mapping them back to English/
Latin words. Of course, this is not practical for natural language
processing, as the vocabulary gets too large. And you can get
ambiguous matches, such as a vocabulary containing the words ['in',
'to', 'into'] - the runtogether "into" will always be assumed to be
"into", and never "in to". Fortunately (for pyparsing), your example
was sufficiently friendly as to avoid ambiguities. But if you can
select a suitable vocabulary, even a runon mush is parseable.

-- Paul

Aug 27 '07 #20

Steven Bethard

Paul McGuire wrote:

On Aug 26, 10:48 pm, Steven Bethard <steven.beth...@gmail.comwrote:
>In Japanese and Chinese tokenization, word boundaries are not marked by
different classes of characters. They only exist in the mind of the
reader who knows which sequences of characters could be words given the
context, and which sequences of characters couldn't.

The closest analog would be to ask pyparsing to find the words in the
following sentence:

Thepyparsingmoduleprovidesalibraryofclassesthatcl ientcodeusestoconstructthe*grammardirectlyinPython code.

Most approaches that have been even marginally successful on these kinds
of tasks have used statistical machine learning approaches.

You mean like this?

from pyparsing import *

knownWords = ['of', 'grammar', 'construct', 'classes', 'a',
'client', 'pyparsing', 'directly', 'the', 'module', 'uses',
'that', 'in', 'python', 'library', 'provides', 'code', 'to']

knownWord = oneOf( knownWords, caseless=True )
sentence = OneOrMore( knownWord ) + "."

mush =
"Thepyparsingmoduleprovidesalibraryofclassesthatcl ientcodeusestoconstructthegrammardirectlyinPythonc ode."

print sentence.parseString( mush )

prints:

['the', 'pyparsing', 'module', 'provides', 'a', 'library', 'of',
'classes', 'that', 'client', 'code', 'uses', 'to', 'construct',
'the', 'grammar', 'directly', 'in', 'python', 'code', '.']

In fact, this is almost the exact scheme used by Zhpy for extracting
Chinese versions of Python keywords, and mapping them back to English/
Latin words. Of course, this is not practical for natural language
processing, as the vocabulary gets too large. And you can get
ambiguous matches, such as a vocabulary containing the words ['in',
'to', 'into'] - the runtogether "into" will always be assumed to be
"into", and never "in to".

Yep, and these kinds of things occur quite frequently with Chinese and
Japanese. The point was not that pyparsing couldn't do it for a small
subset of characters/words, but that pyparsing is probably not the right
solution for general purpose Japanese/Chinese tokenization.

Steve

Aug 27 '07 #21

Parser Generator?

Similar topics