473,289 Members | 1,780 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,289 software developers and data experts.

emacs lisp as text processing language...

Text Processing with Emacs Lisp

Xah Lee, 2007-10-29

This page gives a outline of how to use emacs lisp to do text
processing, using a specific real-world problem as example. If you
don't know elisp, first take a gander at Emacs Lisp Basics.

HTML version with links and colors is at:
http://xahlee.org/emacs/elisp_text_processing.html

Following this post as a separate post, is some relevant (i hope)
remark about Perl and Python.

---------------------------------
THE PROBLEM

----------------
Summary

I want to write a elisp program, that process a list of given files.
Each file is a HTML file. For each file, i want to remove the link to
itself, in its page navigation bar. More specifically, each file have
a page navigation bar in this format:

<div class="pages">Goto Page: <a href="1.html">1</a>, <a
href="2.html">2</a>, <a href="3.html">3</a>, <a href="4.html">3</
a>, ...</div>.

where the file names and link texts are all arbitrary. (not as 1, 2, 3
shown here.) The link to itself needs to be removed.

----------------
Detail

My website has over 3 thousand files; many of the pages is a series.
For example, i have a article on Algorithmic Mathematical Art, which
is broken into 3 HTML pages. So, at the bottom of each page, i have a
page navigation bar with code like this:

<div class="pages">Goto Page: <a href="20040113_cmaci_larcu.html">1</
a>, <a href="cmaci_larcu2.html">2</a>, <a href="cmaci_larcu3.html">3</
a></div>

In a browser, it would look like this:
i/page tag

Note that the link to the page itself really shouldn't be a link.

There are a total of 134 pages scattered about in various directories
that has this page navigation bar. I need some automated way to
process these files and remove the self-link.

I've been programing in perl professionally from 1998 to 2002 full
time. Typically, for this task in perl (or Python), i'd open each
file, read in the file, then use regex to do the replacement, then
write out the file. For replacement that span over several lines, the
regex needs to act on the whole file (as opposed to one line at a
time). The regex can become quite complex or reaching its limit. For a
more robust solution, a XML/HTML parser package can be used to read in
the file into a structured representation, then process that. Using a
HTML parser is a bit involved. Then, as usual, one may need to create
backups of the original files, and also deal with maintaining the
file's meta info such as keeping the same permission bits. In summary,
if the particular text-processing required is not simple, then the
coding gets fairly complex quickly, even if job is trivial in
principle.

With emacs lisp, the task is vastly simplified, because emacs reads in
a file into its buffer representation. With buffers, one can move a
pointer back and forth, search and delete or insert text arbitrarily,
with the entire emacs lisp's suite of functions designed for
processing text, as well the entire emacs environment that
automatically deals with maintaining file. (symbolic links, hard
links, auto-backup system, file meta-info maintaince, file locking,
remote files... etc).

We proceed to write a elisp code to solve this problem.

---------------------------------
SOLUTION

Here's are the steps we need to do for each file:

* open the file in a buffer
* move cursor to the page navigation text.
* move cursor to file name
* run sgml-delete-tag (removes the link)
* save file
* close buffer

We begin by writing a test code to process a single file.

(defun xx ()
"temp. experimental code"
(interactive)
(let (fpath fname mybuffer)
(setq fpath "/Users/xah/test1.html")
(setq fname (file-name-nondirectory fpath))
(setq mybuffer (find-file fpath))
(search-forward "<div class=\"pages\">Goto Page:")
(search-forward fname)
(sgml-delete-tag 1)
(save-buffer)
(kill-buffer mybuffer)))

First of all, create files test1.html, test2.html, test3.html in a
temp directory for testing this code. Each file will contain this page
navigation line:

<div class="pages">Goto Page: <a href="test1.html">some1</a>, <a
href="test2.html">another</a>, <a href="test3.html">xyz3</a></div>

Note that in actual files, the page-nav string may not be in a single
line.

The elisp code above is fairly simple and self-explanatory. The file
opening function find-file is found from elisp doc section “Files”.
The cursor moving function search-forward is in “Searching and
Matching”, the save or close buffer fuctions are in section “Buffer”.

Reference: Elisp Manual: Files.

Reference: Elisp Manual: Buffers.

Reference: Elisp Manual: Searching-and-Matching.

The interesting part is calling the function sgml-delete-tag. It is a
function loaded by html-mode (which is automatically loaded when a
html file is opened). What sgml-delete-tag does is to delete the tag
that encloses the cursor (both the opening and closing tags will de
deleted). The cursor can be anywhere in the beginning angle bracket of
the opening to the ending angle bracket of the closing tag. This sgml-
delete-tag function helps us tremendously.

Now, with the above code, our job is essentially done. All we need to
do now is to feed it a bunch of file paths. First we clean the code up
by writing it to take a path as argument.

(defun my-modfile-page-tag (fpath)
"Modify the HTML file at fpath."
(let (fname mybuffer)
(setq fname (file-name-nondirectory fpath))
(setq mybuffer (find-file fpath))
(search-forward "<div class=\"pages\">Goto Page:")
(search-forward fname)
(sgml-delete-tag 1)
(save-buffer)
(kill-buffer mybuffer)))

Then, we test this modified code by evaluating the following code:

(my-modfile-page-tag "/Users/xah/test1.html")

To complete our task, all we have to do now is get the list of files
that contains the page-nav tag and feed them to my-modfile-page-tag.

To generate a list of files, we can simply use unix's “find” and
“grep”, like this:

find . -name "*\.html" -exec grep -l '<div class="pages">' {} \;

For each line in the output, we just wrap a double quote around it to
make it a lisp string. Possibly also insert the full path by using
string-rectangle, to construct the following code:

(mapcar (lambda (x) (my-modfile-page-tag x))
(list
"/Users/xah/web/3d/viz.html"
"/Users/xah/web/3d/viz2.html"
"/Users/xah/web/dinju/Khajuraho.html"
"/Users/xah/web/dinju/Khajuraho2.html"
"/Users/xah/web/dinju/Khajuraho3.html"
;... 100+ lines
)
)

The mapcar and lambda is a lisp idiom of looping thru a list. We
evaluate the code and we are all done!

Emacs is beautiful!

(a separate post follows on the relevance of Perl and Python)

Xah
xa*@xahlee.org
http://xahlee.org/

Oct 29 '07 #1
1 3376
.... continued from previous post.

PS I'm cross-posting this post to perl and python groups because i
find that it being a little know fact that emacs lisp's power in the
area of text processing, are far beyond Perl (or Python).

.... i worked as a professional perl programer since 1998. I started to
study elisp as a hobby since 2005. (i started to use emacs daily since
1998) It is only today, while i was studying elisp's file and buffer
related functions, that i realized how elisp can be used as a general
text processing language, and in fact is a dedicated language for this
task, with powers quite beyond Perl (or Python, PHP (Ruby, java, c
etc) etc).

This realization surprised me, because it is well-known that Perl is
the de facto language for text processing, and emacs lisp for this is
almost unknown (outside of elisp developers). The surprise was
exasperated by the fact that Emacs Lisp existed before perl by almost
a decade. (Albeit Emacs Lisp is not suitable for writing general
applications.)

My study about lisp as a text processing tool today, remind me of a
article i read in 2000: “Ilya Regularly Expresses”, of a interview
with Dr Ilya Zakharevich (author of cperl-mode.el and a major
contributor to the Perl language). In the article, he mentioned
something about Perl's lack of text processing primitives that are in
emacs, which i did not fully understand at the time. (i don't know
elisp at the time)

The article is at:
http://www.perl.com/lpt/a/2000/09/ilya.html

Here's the relevant excerpt:
«
Let me also mention that classifying the text handling facilities of
Perl as "extremely agile" gives me the willies. Perl's regular
expressions are indeed more convenient than in other languages.
However, the lack of a lot of key text-processing ingredients makes
Perl solutions for many averagely complicated tasks either extremely
slow, or not easier to maintain than solutions in other languages (and
in some cases both).

I wrote a (heuristic-driven) Perlish syntax parser and transformer in
Emacs Lisp, and though Perl as a language is incomparably friendlier
than Lisps, I would not be even able of thinking about rewriting this
tool in Perl: there are just not enough text-handling primitives
hardwired into Perl. I will need to code all these primitives first.
And having these primitives coded in Perl, the solution would turn out
to be (possibly) hundreds times slower than the built-in Emacs
operations.

My current conjecture on why people classify Perl as an agile text-
handler (in addition to obvious traits of false advertisements) is
that most of the problems to handle are more or less trivial ("system
maintenance"-type problems). For such problems Perl indeed shines. But
between having simple solutions for simple problems and having it
possible to solve complicated problems, there is a principle of having
moderately complicated solutions for moderately complicated problems.
There is no reason for Perl to be not capable of satisfying this
requirement, but currently Perl needs improvement in this regard.
»

Xah
xa*@xahlee.org
http://xahlee.org/

Oct 29 '07 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

699
by: mike420 | last post by:
I think everyone who used Python will agree that its syntax is the best thing going for it. It is very readable and easy for everyone to learn. But, Python does not a have very good macro...
34
by: nobody | last post by:
This article is posted at the request of C.W. Yang who asked me to detail my opinion of Lisp, and for the benefit of people like him, who may find themselves intrigued by this language. The...
49
IDE
by: Thomas Lindgaard | last post by:
Hello I am probably going to start a war now... but so be it :) I just want to hear what all you guys who eat pythons for breakfast use for python coding. Currently I use Kate, but I would...
81
by: julio | last post by:
Sorry but there is no another way, c# .net and mono are going to rip python, not because python is a bad lenguage, but because is to darn old and it refuses to innovate things, to fix wrong things,...
12
by: Rex Eastbourne | last post by:
Hi, I'm interested in running a Python interpreter in Emacs. I have Python extensions for Emacs, and my python menu lists "C-c !" as the command to run the interpreter. Yet when I run it I get...
37
by: Richard G. Riley | last post by:
I am looking for a way to integrate the C library documentation into emacs' C mode. Adding any linux library documentation would be a bonus. e.g hilite word, hotkey to library documentation. ...
7
by: Xah Lee | last post by:
Summary: when encountering ex as a unit in css, FireFox (and iCab) did not take into account the font-family. Detail: http://xahlee.org/js/ff_pre_ex.html Xah xah@xahlee.org ∑...
331
by: Xah Lee | last post by:
http://xahlee.org/emacs/modernization.html ] The Modernization of Emacs ---------------------------------------- THE PROBLEM Emacs is a great editor. It is perhaps the most powerful and...
0
by: xahlee | last post by:
Here's a little tutorial that lets you write emacs commands for processing the current text selection in emacs in your favorite lang. Elisp Wrapper For Perl Scripts...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, youll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.