473,811 Members | 1,788 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Parser Generator?

Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.
Aug 18 '07
20 1867
Thanks Jason. Does Parsing.py support Unicode characters (especially CJK)?
I'll take a look.

"Jason Evans" <jo*****@gmail. comwrote in message
news:11******** **************@ e9g2000prf.goog legroups.com...
On Aug 18, 3:22 pm, "Jack" <nos...@invalid .comwrote:
>Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.

I use Parsing.py. I like it a lot, probably because I wrote it.

http://www.canonware.com/Parsing/

Jason

Aug 24 '07 #11
On Aug 18, 11:37 pm, "Jack" <nos...@invalid .comwrote:
Thanks for all the replies!

SPARK looks promising. Its doc doesn't say if it handles unicode
(CJK in particular) encoding though.

Yapps also looks powerful:http://theory.stanford.edu/~amitp/yapps/

There's also PyGgyhttp://lava.net/~newsham/pyggy/

I may also give Antlr a try.

If anyone has experiences using any of the parser generators with CJK
languages, I'd be very interested in hearing that.

Jack

"Jack" <nos...@invalid .comwrote in message

news:ab******** *************** *******@comcast .com...
Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.
In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.- Hide quoted text -

- Show quoted text -
Jack -

Pyparsing was already mentioned once on this thread. Here is an
application using pyparsing that parses Chinese characters to convert
to English Python.

http://pypi.python.org/pypi/zhpy/0.5

-- Paul

Aug 25 '07 #12
On Aug 24, 1:21 pm, "Jack" <nos...@invalid .comwrote:
"Jason Evans" <joev...@gmail. comwrote in message
http://www.canonware.com/Parsing/

Thanks Jason. Does Parsing.py support Unicode characters (especially CJK)?
I'll take a look.
Parsers typically deal with tokens rather than individual characters,
so the scanner that creates the tokens is the main thing that Unicode
matters to. I have written Unicode-aware scanners for use with
Parsing-based parsers, with no problems. This is pretty easy to do,
since Python has built-in support for Unicode strings.

Jason

Aug 26 '07 #13
Good to know, thanks Paul.
!
"Paul McGuire" <pt***@austin.r r.comwrote in message
Pyparsing was already mentioned once on this thread. Here is an
application using pyparsing that parses Chinese characters to convert
to English Python.

http://pypi.python.org/pypi/zhpy/0.5

-- Paul

Aug 26 '07 #14
Thanks Json. There seem to be a few options that I can pursue. Having a hard
time
chooing one now :)

"Jason Evans" <jo*****@gmail. comwrote in message
news:11******** **************@ o80g2000hse.goo glegroups.com.. .
On Aug 24, 1:21 pm, "Jack" <nos...@invalid .comwrote:
>"Jason Evans" <joev...@gmail. comwrote in message
http://www.canonware.com/Parsing/

Thanks Jason. Does Parsing.py support Unicode characters (especially
CJK)?
I'll take a look.

Parsers typically deal with tokens rather than individual characters,
so the scanner that creates the tokens is the main thing that Unicode
matters to. I have written Unicode-aware scanners for use with
Parsing-based parsers, with no problems. This is pretty easy to do,
since Python has built-in support for Unicode strings.

Jason

Aug 26 '07 #15
On Behalf Of Jason Evans
Parsers typically deal with tokens rather than individual
characters, so the scanner that creates the tokens is the
main thing that Unicode matters to. I have written
Unicode-aware scanners for use with Parsing-based parsers,
with no problems. This is pretty easy to do, since Python
has built-in support for Unicode strings.
The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have to pass the text
through a tokenizer (like ChaSen for Japanese) before using PyParsing.

Regards,
Ryan Ginstrom

Aug 27 '07 #16
On Aug 26, 8:05 pm, "Ryan Ginstrom" <softw...@ginst rom.comwrote:
On Behalf Of Jason Evans
Parsers typically deal with tokens rather than individual
characters, so the scanner that creates the tokens is the
main thing that Unicode matters to. I have written
Unicode-aware scanners for use with Parsing-based parsers,
with no problems. This is pretty easy to do, since Python
has built-in support for Unicode strings.

The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have to pass the text
through a tokenizer (like ChaSen for Japanese) before using PyParsing.

Regards,
Ryan Ginstrom
Did you think pyparsing is so mundane as to require spaces between
tokens? Pyparsing has been doing this type of token-recognition since
Day 1. Looking for tokens without delimiting spaces was one of the
first applications for pyparsing. This issue is not unique to Chinese
or Japanese text. Pyparsing will easily find the tokens in this
string:

y=a*x**2+b*x+c

as

['y','=','a','*' ,'x','**','2',' +','b','*','x', '+','c']

even though there is not a single delimiting space. But pyparsing
will also render this as a nested parse tree, reflecting the
precedence of operations:

['y', '=', [['a', '*', ['x', '**', 2]], '+',['b', '*', 'x'], '+',
'c']]

and will allow you to access individual tokens by field name:
- lhs: y
- rhs: [['a', '*', ['x', '**', 2]], '+', ['b', '*', 'x'], '+', 'c']

Please feel free to look through the posted examples on the pyparsing
wiki at http://pyparsing.wikispaces.com/Examples, or some of the
applications currently using pyparsing at http://pyparsing.wikispaces.com/WhosUsingPyparsing,
and you might get a better feel for what kind of tasks pyparsing is
capable of.

-- Paul

Aug 27 '07 #17
Paul McGuire wrote:
On Aug 26, 8:05 pm, "Ryan Ginstrom" <softw...@ginst rom.comwrote:
>The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have to pass the text
through a tokenizer (like ChaSen for Japanese) before using PyParsing.

Did you think pyparsing is so mundane as to require spaces between
tokens? Pyparsing has been doing this type of token-recognition since
Day 1. Looking for tokens without delimiting spaces was one of the
first applications for pyparsing. This issue is not unique to Chinese
or Japanese text. Pyparsing will easily find the tokens in this
string:

y=a*x**2+b*x+c

as

['y','=','a','*' ,'x','**','2',' +','b','*','x', '+','c']
The difference is that in the expression above (and in many other
tokenization problems) you can determine "word" boundaries by looking at
the class of character, e.g. alphanumeric vs. punctuation vs. whatever.

In Japanese and Chinese tokenization, word boundaries are not marked by
different classes of characters. They only exist in the mind of the
reader who knows which sequences of characters could be words given the
context, and which sequences of characters couldn't.

The closest analog would be to ask pyparsing to find the words in the
following sentence:

Thepyparsingmod uleprovidesalib raryofclassesth atclientcodeuse stoconstructthe grammardirectly inPythoncode.

Most approaches that have been even marginally successful on these kinds
of tasks have used statistical machine learning approaches.

STeVe
Aug 27 '07 #18
On Behalf Of Paul McGuire
>
On Aug 26, 8:05 pm, "Ryan Ginstrom" <softw...@ginst rom.comwrote:
The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have
to pass the
text through a tokenizer (like ChaSen for Japanese) before
using PyParsing.

Did you think pyparsing is so mundane as to require spaces
between tokens? Pyparsing has been doing this type of
token-recognition since Day 1.
Cool! I stand happily corrected. I did write "I think" because although I
couldn't find a way to do it, there might well actually be one <g>. I'll
keep looking to find some examples of parsing Japanese.

BTW, I think PyParsing is great, and I use it for several tasks. I just
could never figure out a way to use it with Japanese (at least on the
applications I had in mind).

Regards,
Ryan Ginstrom

Aug 27 '07 #19
On Aug 26, 10:48 pm, Steven Bethard <steven.beth... @gmail.comwrote :
Paul McGuire wrote:
On Aug 26, 8:05 pm, "Ryan Ginstrom" <softw...@ginst rom.comwrote:
The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have to pass the text
through a tokenizer (like ChaSen for Japanese) before using PyParsing.
Did you think pyparsing is so mundane as to require spaces between
tokens? Pyparsing has been doing this type of token-recognition since
Day 1. Looking for tokens without delimiting spaces was one of the
first applications for pyparsing. This issue is not unique to Chinese
or Japanese text. Pyparsing will easily find the tokens in this
string:
y=a*x**2+b*x+c
as
['y','=','a','*' ,'x','**','2',' +','b','*','x', '+','c']

The difference is that in the expression above (and in many other
tokenization problems) you can determine "word" boundaries by looking at
the class of character, e.g. alphanumeric vs. punctuation vs. whatever.

In Japanese and Chinese tokenization, word boundaries are not marked by
different classes of characters. They only exist in the mind of the
reader who knows which sequences of characters could be words given the
context, and which sequences of characters couldn't.

The closest analog would be to ask pyparsing to find the words in the
following sentence:

Thepyparsingmod uleprovidesalib raryofclassesth atclientcodeuse stoconstructthe *grammardirectl yinPythoncode.

Most approaches that have been even marginally successful on these kinds
of tasks have used statistical machine learning approaches.

STeVe- Hide quoted text -

- Show quoted text -
Steve -

You mean like this?

from pyparsing import *

knownWords = ['of', 'grammar', 'construct', 'classes', 'a',
'client', 'pyparsing', 'directly', 'the', 'module', 'uses',
'that', 'in', 'python', 'library', 'provides', 'code', 'to']

knownWord = oneOf( knownWords, caseless=True )
sentence = OneOrMore( knownWord ) + "."

mush =
"Thepyparsingmo duleprovidesali braryofclassest hatclientcodeus estoconstructth egrammardirectl yinPythoncode."

print sentence.parseS tring( mush )

prints:

['the', 'pyparsing', 'module', 'provides', 'a', 'library', 'of',
'classes', 'that', 'client', 'code', 'uses', 'to', 'construct',
'the', 'grammar', 'directly', 'in', 'python', 'code', '.']

In fact, this is almost the exact scheme used by Zhpy for extracting
Chinese versions of Python keywords, and mapping them back to English/
Latin words. Of course, this is not practical for natural language
processing, as the vocabulary gets too large. And you can get
ambiguous matches, such as a vocabulary containing the words ['in',
'to', 'into'] - the runtogether "into" will always be assumed to be
"into", and never "in to". Fortunately (for pyparsing), your example
was sufficiently friendly as to avoid ambiguities. But if you can
select a suitable vocabulary, even a runon mush is parseable.

-- Paul
Aug 27 '07 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

11
9620
by: Jean de Largentaye | last post by:
Hi, I need to parse a subset of C (a header file), and generate some unit tests for the functions listed in it. I thus need to parse the code, then rewrite function calls with wrong parameters. What I call "shaking the broken tree" :) I chose to make my UT-generator in Python 2.4. However, I am now encountering problems in choosing the right parser for the job. I struggle in choosing between the inappropriate, the out-of-date, the...
2
429
by: alederer | last post by:
Hallo! Does anybody know a parser generator that supports unicode (UTF-16), and is based on something like ICU. The parser is used in a platform independent and cross-platform communicating application. thanks andreas
12
2653
by: pmatos | last post by:
Hi all, I need to create a parser in C++ which should parse a somewhat complex file structure. Yeah, it would be possible to write it from scratch in C++ however I think it would be easier and the end-result would probably be more efficient if I use a parser generator. However, I only know flex/bison from a project in C along time ago. Any ideas on what to use nowadays and for C++ would be nice. Cheers,
3
1759
by: Jang | last post by:
Could anyone point or send me a syntax of C which fits to Parser Generator (YAAC) ? I've got a big problem because I have to write a translator C ->assembler :[ There is a lot of work to convert rules in BNF to YAAC. How to deal with it ? Thank in advance.
2
1903
by: karthik bala guru | last post by:
Hi, I would like to have a XHTML Generator and Parser in C language from the open source community. Someone Here, kindly give me a link or the name of the tool available in the opensource world. Or atleast a XML Generator / Parser that can be easily converted to XHTML Parser Generator in C with the very minimum changes.
7
5691
by: (Jamie Andrews) | last post by:
For a research project, we're looking for a reliable parser for C that will take an ANSI C program and yield a tree representation of the program (as a Java or C++ object). Of course a grammar e.g. in jflex/jbison that will yield the same thing is fine too. We have been able to find some grammars and parsers, of unknown reliability, that don't yield a syntax tree; we want to avoid starting with a flaky parser and/or adding the syntax...
6
2745
by: Mike C# | last post by:
Hi all, Can anyone recommend a good and *easy to use* lexer and parser generator? Preferably one that was written specifically for VC++ and not mangled through 20 different platforms. I've had it up to here (funny hand gesture) with trying to compile the bullet-riddled code that GNU Flex and Bison keep spitting out for even the simplest of grammars (really funny hand gesture). Thanks
9
3627
by: Peter Michaux | last post by:
Hi, Does a parser generator exist that generates JavaScript code? Most parser generators generate C or Java code. Thanks, Peter
28
16436
by: Marc Gravell | last post by:
In Linq, you can apparently get a meaningful body from and expression's .ToString(); random question - does anybody know if linq also includes a parser? It just seemed it might be a handy way to write a safe but easy implementation (i.e. no codedom) for an IBindingListView.Filter (by compiling to a Predicate<T>). Anybody know if this is possible at all? Marc
0
9728
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10648
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10135
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9205
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7670
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6890
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
1
4339
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3867
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3018
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.