473,395 Members | 1,938 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

Parser Generator?

Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.
Aug 18 '07 #1
20 1821
Jack schrieb:
Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.
There are several options. I personally like spark.py, the most common
answer is pyparsing, and don't forget to check out NLTK, the natural
language toolkit.

Diez
Aug 18 '07 #2
On Aug 18, 5:22 pm, "Jack" <nos...@invalid.comwrote:
Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.
Antlr seems to be able to generate python code, too.

Aug 19 '07 #3

On 19 aug 2007, at 00.22, Jack wrote:
Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.
--
http://mail.python.org/mailman/listinfo/python-list
Antlr can generate Python code.
However, I don't think a parser generator is suitable for generating
natural language parsers.
They are intended to generate code for computer language parsers.
However, for examples on parsing imperative English sentences, I
suggest taking a look
at the class library for TADS 3 (Text Adventure Development System)
<http://www.tads.org>
The lanuge has a syntax reminding of c++ and Java.
-----------------------------------------------------
An astronomer to a colleague:
-I can't understsnad how you can go to the brothel as often as you
do. Not only is it a filthy habit, but it must cost a lot of money too.
-Thats no problem. I've got a big government grant for the study of
black holes.
Tommy Nordgren
to************@comhem.se

Aug 19 '07 #4
Thanks for all the replies!

SPARK looks promising. Its doc doesn't say if it handles unicode
(CJK in particular) encoding though.

Yapps also looks powerful: http://theory.stanford.edu/~amitp/yapps/

There's also PyGgy http://lava.net/~newsham/pyggy/

I may also give Antlr a try.

If anyone has experiences using any of the parser generators with CJK
languages, I'd be very interested in hearing that.

Jack
"Jack" <no****@invalid.comwrote in message
news:ab******************************@comcast.com. ..
Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.

Aug 19 '07 #5
Jack wrote:
Thanks for all the replies!

SPARK looks promising. Its doc doesn't say if it handles unicode
(CJK in particular) encoding though.

Yapps also looks powerful: http://theory.stanford.edu/~amitp/yapps/

There's also PyGgy http://lava.net/~newsham/pyggy/

I may also give Antlr a try.

If anyone has experiences using any of the parser generators with CJK
languages, I'd be very interested in hearing that.
I'm going to echo Tommy's reply. If you want to parse natural language,
conventional parsers are going to be worse than useless (because you'll
keep thinking, "Just one more tweak and this time it'll work for
sure!"). Instead, go look at what the interactive fiction community
uses. They analyse the statement in multiple passes, first picking out
the verbs, then the noun phrases. Some of their parsers can do
on-the-fly domain-specific spelling correction, etc, and all of them can
ask the user for clarification. (I'm currently cobbling together
something similar for pre-teen users.)
Aug 19 '07 #6
Thanks for the suggestion. I understand that more work is needed for natural
language
understanding. What I want to do is actually very simple - I pre-screen the
user
typed text. If it's a simple syntax my code understands, like, Weather in
London, I'll
redirect it to a weather site. Or, if it's "What is ... " I'll probably
redirect it to wikipedia.
Otherwise, I'll throw it to a search engine. So, extremelyl simple stuff ...

"samwyse" <de******@email.comwrote in message
news:xH****************@nlpi068.nbdc.sbc.com...
Jack wrote:
>Thanks for all the replies!

SPARK looks promising. Its doc doesn't say if it handles unicode
(CJK in particular) encoding though.

Yapps also looks powerful: http://theory.stanford.edu/~amitp/yapps/

There's also PyGgy http://lava.net/~newsham/pyggy/

I may also give Antlr a try.

If anyone has experiences using any of the parser generators with CJK
languages, I'd be very interested in hearing that.

I'm going to echo Tommy's reply. If you want to parse natural language,
conventional parsers are going to be worse than useless (because you'll
keep thinking, "Just one more tweak and this time it'll work for sure!").
Instead, go look at what the interactive fiction community uses. They
analyse the statement in multiple passes, first picking out the verbs,
then the noun phrases. Some of their parsers can do on-the-fly
domain-specific spelling correction, etc, and all of them can ask the user
for clarification. (I'm currently cobbling together something similar for
pre-teen users.)

Aug 19 '07 #7
Jack <no****@invalid.comwrote:
Thanks for the suggestion. I understand that more work is needed for natural
language
understanding. What I want to do is actually very simple - I pre-screen the
user
typed text. If it's a simple syntax my code understands, like, Weather in
London, I'll
redirect it to a weather site. Or, if it's "What is ... " I'll probably
redirect it to wikipedia.
Otherwise, I'll throw it to a search engine. So, extremelyl simple stuff ...
<http://nltk.sourceforge.net/index.php/Main_Page>

"""
NLTK — the Natural Language Toolkit — is a suite of open source Python
modules, data sets and tutorials supporting research and development in
natural language processing.
"""
Alex
Aug 19 '07 #8
Very interesting work. Thanks for the link!

"Alex Martelli" <al***@mac.comwrote in message
news:1i**************************@mac.com...
<http://nltk.sourceforge.net/index.php/Main_Page>

"""
NLTK ¡ª the Natural Language Toolkit ¡ª is a suite of open source Python
modules, data sets and tutorials supporting research and development in
natural language processing.
"""
Alex

Aug 20 '07 #9
On Aug 18, 3:22 pm, "Jack" <nos...@invalid.comwrote:
Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.
I use Parsing.py. I like it a lot, probably because I wrote it.

http://www.canonware.com/Parsing/

Jason

Aug 23 '07 #10
Thanks Jason. Does Parsing.py support Unicode characters (especially CJK)?
I'll take a look.

"Jason Evans" <jo*****@gmail.comwrote in message
news:11**********************@e9g2000prf.googlegro ups.com...
On Aug 18, 3:22 pm, "Jack" <nos...@invalid.comwrote:
>Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.

I use Parsing.py. I like it a lot, probably because I wrote it.

http://www.canonware.com/Parsing/

Jason

Aug 24 '07 #11
On Aug 18, 11:37 pm, "Jack" <nos...@invalid.comwrote:
Thanks for all the replies!

SPARK looks promising. Its doc doesn't say if it handles unicode
(CJK in particular) encoding though.

Yapps also looks powerful:http://theory.stanford.edu/~amitp/yapps/

There's also PyGgyhttp://lava.net/~newsham/pyggy/

I may also give Antlr a try.

If anyone has experiences using any of the parser generators with CJK
languages, I'd be very interested in hearing that.

Jack

"Jack" <nos...@invalid.comwrote in message

news:ab******************************@comcast.com. ..
Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.
In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.- Hide quoted text -

- Show quoted text -
Jack -

Pyparsing was already mentioned once on this thread. Here is an
application using pyparsing that parses Chinese characters to convert
to English Python.

http://pypi.python.org/pypi/zhpy/0.5

-- Paul

Aug 25 '07 #12
On Aug 24, 1:21 pm, "Jack" <nos...@invalid.comwrote:
"Jason Evans" <joev...@gmail.comwrote in message
http://www.canonware.com/Parsing/

Thanks Jason. Does Parsing.py support Unicode characters (especially CJK)?
I'll take a look.
Parsers typically deal with tokens rather than individual characters,
so the scanner that creates the tokens is the main thing that Unicode
matters to. I have written Unicode-aware scanners for use with
Parsing-based parsers, with no problems. This is pretty easy to do,
since Python has built-in support for Unicode strings.

Jason

Aug 26 '07 #13
Good to know, thanks Paul.
!
"Paul McGuire" <pt***@austin.rr.comwrote in message
Pyparsing was already mentioned once on this thread. Here is an
application using pyparsing that parses Chinese characters to convert
to English Python.

http://pypi.python.org/pypi/zhpy/0.5

-- Paul

Aug 26 '07 #14
Thanks Json. There seem to be a few options that I can pursue. Having a hard
time
chooing one now :)

"Jason Evans" <jo*****@gmail.comwrote in message
news:11**********************@o80g2000hse.googlegr oups.com...
On Aug 24, 1:21 pm, "Jack" <nos...@invalid.comwrote:
>"Jason Evans" <joev...@gmail.comwrote in message
http://www.canonware.com/Parsing/

Thanks Jason. Does Parsing.py support Unicode characters (especially
CJK)?
I'll take a look.

Parsers typically deal with tokens rather than individual characters,
so the scanner that creates the tokens is the main thing that Unicode
matters to. I have written Unicode-aware scanners for use with
Parsing-based parsers, with no problems. This is pretty easy to do,
since Python has built-in support for Unicode strings.

Jason

Aug 26 '07 #15
On Behalf Of Jason Evans
Parsers typically deal with tokens rather than individual
characters, so the scanner that creates the tokens is the
main thing that Unicode matters to. I have written
Unicode-aware scanners for use with Parsing-based parsers,
with no problems. This is pretty easy to do, since Python
has built-in support for Unicode strings.
The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have to pass the text
through a tokenizer (like ChaSen for Japanese) before using PyParsing.

Regards,
Ryan Ginstrom

Aug 27 '07 #16
On Aug 26, 8:05 pm, "Ryan Ginstrom" <softw...@ginstrom.comwrote:
On Behalf Of Jason Evans
Parsers typically deal with tokens rather than individual
characters, so the scanner that creates the tokens is the
main thing that Unicode matters to. I have written
Unicode-aware scanners for use with Parsing-based parsers,
with no problems. This is pretty easy to do, since Python
has built-in support for Unicode strings.

The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have to pass the text
through a tokenizer (like ChaSen for Japanese) before using PyParsing.

Regards,
Ryan Ginstrom
Did you think pyparsing is so mundane as to require spaces between
tokens? Pyparsing has been doing this type of token-recognition since
Day 1. Looking for tokens without delimiting spaces was one of the
first applications for pyparsing. This issue is not unique to Chinese
or Japanese text. Pyparsing will easily find the tokens in this
string:

y=a*x**2+b*x+c

as

['y','=','a','*','x','**','2','+','b','*','x','+',' c']

even though there is not a single delimiting space. But pyparsing
will also render this as a nested parse tree, reflecting the
precedence of operations:

['y', '=', [['a', '*', ['x', '**', 2]], '+',['b', '*', 'x'], '+',
'c']]

and will allow you to access individual tokens by field name:
- lhs: y
- rhs: [['a', '*', ['x', '**', 2]], '+', ['b', '*', 'x'], '+', 'c']

Please feel free to look through the posted examples on the pyparsing
wiki at http://pyparsing.wikispaces.com/Examples, or some of the
applications currently using pyparsing at http://pyparsing.wikispaces.com/WhosUsingPyparsing,
and you might get a better feel for what kind of tasks pyparsing is
capable of.

-- Paul

Aug 27 '07 #17
Paul McGuire wrote:
On Aug 26, 8:05 pm, "Ryan Ginstrom" <softw...@ginstrom.comwrote:
>The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have to pass the text
through a tokenizer (like ChaSen for Japanese) before using PyParsing.

Did you think pyparsing is so mundane as to require spaces between
tokens? Pyparsing has been doing this type of token-recognition since
Day 1. Looking for tokens without delimiting spaces was one of the
first applications for pyparsing. This issue is not unique to Chinese
or Japanese text. Pyparsing will easily find the tokens in this
string:

y=a*x**2+b*x+c

as

['y','=','a','*','x','**','2','+','b','*','x','+',' c']
The difference is that in the expression above (and in many other
tokenization problems) you can determine "word" boundaries by looking at
the class of character, e.g. alphanumeric vs. punctuation vs. whatever.

In Japanese and Chinese tokenization, word boundaries are not marked by
different classes of characters. They only exist in the mind of the
reader who knows which sequences of characters could be words given the
context, and which sequences of characters couldn't.

The closest analog would be to ask pyparsing to find the words in the
following sentence:

Thepyparsingmoduleprovidesalibraryofclassesthatcli entcodeusestoconstructthegrammardirectlyinPythonco de.

Most approaches that have been even marginally successful on these kinds
of tasks have used statistical machine learning approaches.

STeVe
Aug 27 '07 #18
On Behalf Of Paul McGuire
>
On Aug 26, 8:05 pm, "Ryan Ginstrom" <softw...@ginstrom.comwrote:
The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have
to pass the
text through a tokenizer (like ChaSen for Japanese) before
using PyParsing.

Did you think pyparsing is so mundane as to require spaces
between tokens? Pyparsing has been doing this type of
token-recognition since Day 1.
Cool! I stand happily corrected. I did write "I think" because although I
couldn't find a way to do it, there might well actually be one <g>. I'll
keep looking to find some examples of parsing Japanese.

BTW, I think PyParsing is great, and I use it for several tasks. I just
could never figure out a way to use it with Japanese (at least on the
applications I had in mind).

Regards,
Ryan Ginstrom

Aug 27 '07 #19
On Aug 26, 10:48 pm, Steven Bethard <steven.beth...@gmail.comwrote:
Paul McGuire wrote:
On Aug 26, 8:05 pm, "Ryan Ginstrom" <softw...@ginstrom.comwrote:
The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have to pass the text
through a tokenizer (like ChaSen for Japanese) before using PyParsing.
Did you think pyparsing is so mundane as to require spaces between
tokens? Pyparsing has been doing this type of token-recognition since
Day 1. Looking for tokens without delimiting spaces was one of the
first applications for pyparsing. This issue is not unique to Chinese
or Japanese text. Pyparsing will easily find the tokens in this
string:
y=a*x**2+b*x+c
as
['y','=','a','*','x','**','2','+','b','*','x','+',' c']

The difference is that in the expression above (and in many other
tokenization problems) you can determine "word" boundaries by looking at
the class of character, e.g. alphanumeric vs. punctuation vs. whatever.

In Japanese and Chinese tokenization, word boundaries are not marked by
different classes of characters. They only exist in the mind of the
reader who knows which sequences of characters could be words given the
context, and which sequences of characters couldn't.

The closest analog would be to ask pyparsing to find the words in the
following sentence:

Thepyparsingmoduleprovidesalibraryofclassesthatcli entcodeusestoconstructthe*grammardirectlyinPythonc ode.

Most approaches that have been even marginally successful on these kinds
of tasks have used statistical machine learning approaches.

STeVe- Hide quoted text -

- Show quoted text -
Steve -

You mean like this?

from pyparsing import *

knownWords = ['of', 'grammar', 'construct', 'classes', 'a',
'client', 'pyparsing', 'directly', 'the', 'module', 'uses',
'that', 'in', 'python', 'library', 'provides', 'code', 'to']

knownWord = oneOf( knownWords, caseless=True )
sentence = OneOrMore( knownWord ) + "."

mush =
"Thepyparsingmoduleprovidesalibraryofclassesthatcl ientcodeusestoconstructthegrammardirectlyinPythonc ode."

print sentence.parseString( mush )

prints:

['the', 'pyparsing', 'module', 'provides', 'a', 'library', 'of',
'classes', 'that', 'client', 'code', 'uses', 'to', 'construct',
'the', 'grammar', 'directly', 'in', 'python', 'code', '.']

In fact, this is almost the exact scheme used by Zhpy for extracting
Chinese versions of Python keywords, and mapping them back to English/
Latin words. Of course, this is not practical for natural language
processing, as the vocabulary gets too large. And you can get
ambiguous matches, such as a vocabulary containing the words ['in',
'to', 'into'] - the runtogether "into" will always be assumed to be
"into", and never "in to". Fortunately (for pyparsing), your example
was sufficiently friendly as to avoid ambiguities. But if you can
select a suitable vocabulary, even a runon mush is parseable.

-- Paul
Aug 27 '07 #20
Paul McGuire wrote:
On Aug 26, 10:48 pm, Steven Bethard <steven.beth...@gmail.comwrote:
>In Japanese and Chinese tokenization, word boundaries are not marked by
different classes of characters. They only exist in the mind of the
reader who knows which sequences of characters could be words given the
context, and which sequences of characters couldn't.

The closest analog would be to ask pyparsing to find the words in the
following sentence:

Thepyparsingmoduleprovidesalibraryofclassesthatcl ientcodeusestoconstructthe*grammardirectlyinPython code.

Most approaches that have been even marginally successful on these kinds
of tasks have used statistical machine learning approaches.

You mean like this?

from pyparsing import *

knownWords = ['of', 'grammar', 'construct', 'classes', 'a',
'client', 'pyparsing', 'directly', 'the', 'module', 'uses',
'that', 'in', 'python', 'library', 'provides', 'code', 'to']

knownWord = oneOf( knownWords, caseless=True )
sentence = OneOrMore( knownWord ) + "."

mush =
"Thepyparsingmoduleprovidesalibraryofclassesthatcl ientcodeusestoconstructthegrammardirectlyinPythonc ode."

print sentence.parseString( mush )

prints:

['the', 'pyparsing', 'module', 'provides', 'a', 'library', 'of',
'classes', 'that', 'client', 'code', 'uses', 'to', 'construct',
'the', 'grammar', 'directly', 'in', 'python', 'code', '.']

In fact, this is almost the exact scheme used by Zhpy for extracting
Chinese versions of Python keywords, and mapping them back to English/
Latin words. Of course, this is not practical for natural language
processing, as the vocabulary gets too large. And you can get
ambiguous matches, such as a vocabulary containing the words ['in',
'to', 'into'] - the runtogether "into" will always be assumed to be
"into", and never "in to".
Yep, and these kinds of things occur quite frequently with Chinese and
Japanese. The point was not that pyparsing couldn't do it for a small
subset of characters/words, but that pyparsing is probably not the right
solution for general purpose Japanese/Chinese tokenization.

Steve
Aug 27 '07 #21

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

11
by: Jean de Largentaye | last post by:
Hi, I need to parse a subset of C (a header file), and generate some unit tests for the functions listed in it. I thus need to parse the code, then rewrite function calls with wrong parameters....
2
by: alederer | last post by:
Hallo! Does anybody know a parser generator that supports unicode (UTF-16), and is based on something like ICU. The parser is used in a platform independent and cross-platform communicating...
12
by: pmatos | last post by:
Hi all, I need to create a parser in C++ which should parse a somewhat complex file structure. Yeah, it would be possible to write it from scratch in C++ however I think it would be easier and...
3
by: Jang | last post by:
Could anyone point or send me a syntax of C which fits to Parser Generator (YAAC) ? I've got a big problem because I have to write a translator C ->assembler :[ There is a lot of work to...
2
by: karthik bala guru | last post by:
Hi, I would like to have a XHTML Generator and Parser in C language from the open source community. Someone Here, kindly give me a link or the name of the tool available in the opensource world....
7
by: (Jamie Andrews) | last post by:
For a research project, we're looking for a reliable parser for C that will take an ANSI C program and yield a tree representation of the program (as a Java or C++ object). Of course a grammar...
6
by: Mike C# | last post by:
Hi all, Can anyone recommend a good and *easy to use* lexer and parser generator? Preferably one that was written specifically for VC++ and not mangled through 20 different platforms. I've had...
9
by: Peter Michaux | last post by:
Hi, Does a parser generator exist that generates JavaScript code? Most parser generators generate C or Java code. Thanks, Peter
28
by: Marc Gravell | last post by:
In Linq, you can apparently get a meaningful body from and expression's .ToString(); random question - does anybody know if linq also includes a parser? It just seemed it might be a handy way to...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.