Help | Site Map
Connecting Tech Pros Worldwide
 
 
LinkBack Thread Tools
  #1  
Old August 2nd, 2006, 09:15 PM
Christoph Haas
Guest
 
Posts: n/a
Default Running queries on large data structure

Hi, list...

I have written an application in Perl some time ago (I was young and needed
the money) that parses multiple large text files containing nested data
structures and allows the user to run quick queries on the data.
(For the firewall admins among you: it's a parser and web-based query tool
for CheckPoint firewall rulebases. The user can search for source and
destination IPs and get the matching rules.)

The current application consists of two parts:

(1) An importing process that reads and parses the large text files and
writes the data in different PostgreSQL tables.
(2) A web (CGI) interface that allows the user to query the collected
data from the PostgreSQL database by different criteria.
(I don't like PostgreSQL much due to the lack of decent tools like
phpmyadmin. Pgadmin3 and Phppgadmin don't give me the feeling that
I control the database. More the other way round. But PostgreSQL
has a nice 'inet' data type that allows for quick matches in tables
of IP addresses and networks.)

However the information in the (relational) database was stored in a
horribily artificial way. The SQL query is a 20-line monster with UNIONs
and LEFT JOINs and negations. It's lightning fast (0.5 seconds to search
over a 500 set consisting of complex rules) but neither the source code
nor the database is easy to handle any more. And I'd like to have more
flexibility in the kind of queries I run. So I'd like to trade the good
speed by some readability and a simpler - more object-oriented - data
structure.

I'm currently thinking of different ways to handle that but would like to
get some opinions about that:

(a) See what sqlalchemy can do for me to handle the object-relational
transformation and basically stay with PostgreSQL.
(b) Parse the input files into one large nested Python data structure.
Then write this structure to a file using "marshal" or "repr".
Then I have a very clean source code like
for rule in rules:
for src in sources:
if searched_for_src == src...
(c) ...?

What makes PostgreSQL less suited is the fact that CheckPoint rule bases
can contain several complex objects:
- Hosts (easy, they are just one IP address and can easily be compared)
- Networks (nearly as easy - just see if the IP is part of the network)
- Groups (slightly harder; can even be nested and contain other groups
and hosts or networks)
- IP ranges (10.0.0.50-10.5.25.100; not easy to parse either)

I would even like to allow the users more complex queries like multiple
search conditions. The query would be something like "show me all matching
firewall rules where 10.0.0.5 matches the source column and 192.168.42.1
matches the destination column OR any rule where the group 'internal hosts'
is mentioned in the destination column". It sounds like a database is the
right job. But somehow a database is also not flexible enough. And the
data is small enough (1 MB probably) that it can be read into memory.

What do you think would be the right tool for the job? Thanks for sharing
your thoughts.

Christoph
 

Bookmarks


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over network members.
Post your question now . . .
It's fast and it's free

Popular Articles