Index on TEXT versus CHAR(32)... fast exact TEXT matching

Jon Lapham

I have a table that stores TEXT information. I need query this table to
find *exact* matches to the TEXT... no regular expressions, no LIKE
queries, etc. The TEXT could be from 1 to 10000+ characters in length,
quite variable. If it matters, the TEXT may contain UNICODE characters...

Example:
CREATE TABLE a (id SERIAL, thetext TEXT);
SELECT id FROM a WHERE thetext='Some other text';

One way I thought to optimize this process would be to store an MD5 hash
of the "thetext" column and query on that:

CREATE TABLE a (id SERIAL, thetext TEXT, thetext_md5 CHAR(32));
SELECT id FROM a WHERE thetext_m5d=md5('Some other text');

Now, obviously I would want to build an INDEX on either "thetext" or
"thetext_md5", depending on which way I decide to make the table.

My question is, what is the absolute fastest way to find the exact match
to a TEXT column? Any amount of pre-processing is fine (such as
calculating the MD5's of all the TEXT tuples), but the query must be
extremely fast.

Has anyone compared (theoretical or practical) the performance of
querying a TEXT-based INDEX versus a CHAR(32)-based INDEX? Is my MD5
idea a waste of time? Is there something better than MD5? Would it be
better to store the "fingerprint" of the TEXT as an integer somehow, so
that the INDEX could be based on a INT* column?

Thanks for any help!

-Jon

--
-**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*---
Jon Lapham <la****@jandr.org> Rio de Janeiro, Brasil
Personal: http://www.jandr.org/
***-*--*----*-------*------------*--------------------*---------------
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #1

Subscribe Post Reply

4146

Pierre-Frédéric Caillaud

> CREATE TABLE a (id SERIAL, thetext TEXT, thetext_md5 CHAR(32));

SELECT id FROM a WHERE thetext_m5d=md5('Some other text');

From your explanations :
- you need an exact match text->id
- you already know the text

Then, why bother with the TEXT column ?
I have no idea on the performance of pg text indexes, but I do know that
a table which fits in RAM, or an index which fits in RAM, is faster than
when it does not.

Try this :

create table my_searches (
id serial primary key,
text_md5 char[16] (or even a binary string)
);

Then create a table my_text (id,full_text) with appropriate foreign key
(to get a 1<->1 relationship between your tables) because you may still
want to remember the text. Index it on id of course.

My point is that your search table no longer has the big text field, it
has only constant size fields, (thus constant rowsize) and it is much
smaller. The index will also be smaller and maybe fit in RAM. You then :

SELECT id FROM my_text WHERE id in (SELECT id FROM my_searches WHERE
text_md5 = (your md5)) AND full_text=(your text);

Unless a MD5 collision occurs, which should be rare, the subquery will
return 1 result and the first SELECT will just do an index lookup on ID.

Now, ok, I'm stupid, I reimplemented the Hash index...

I'd be interested in your benchmarks of your real data with :
- table with full text + id, with btree index on text
- table with full text + id, with hash index on text
- the scheme I just came up with

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 23 '05 #2

NTPT

what about to use a CRC32 checksum of the text, computed by client
application an then make index of crc32 data ?

ie

add column crc int4,
add column md5 varchar(255 )
create index blabla on mytable (crc)
or even create index blabla2 on mytable (crc,md5)

and query like ......... where crc='crc32 of your searched text' AND
md5='md5 hash'

i am not shure, but this should work pretty fast.
----- Original Message -----
From: "Jon Lapham" <la****@jandr.org>
To: <pg***********@postgresql.org>
Sent: Saturday, September 04, 2004 3:04 PM
Subject: [GENERAL] Index on TEXT versus CHAR(32)... fast exact TEXT matching

I have a table that stores TEXT information. I need query this table to
find *exact* matches to the TEXT... no regular expressions, no LIKE
queries, etc. The TEXT could be from 1 to 10000+ characters in length,
quite variable. If it matters, the TEXT may contain UNICODE characters...

Example:
CREATE TABLE a (id SERIAL, thetext TEXT);
SELECT id FROM a WHERE thetext='Some other text';

One way I thought to optimize this process would be to store an MD5 hash
of the "thetext" column and query on that:

CREATE TABLE a (id SERIAL, thetext TEXT, thetext_md5 CHAR(32));
SELECT id FROM a WHERE thetext_m5d=md5('Some other text');

Now, obviously I would want to build an INDEX on either "thetext" or
"thetext_md5", depending on which way I decide to make the table.

My question is, what is the absolute fastest way to find the exact match
to a TEXT column? Any amount of pre-processing is fine (such as
calculating the MD5's of all the TEXT tuples), but the query must be
extremely fast.

Has anyone compared (theoretical or practical) the performance of
querying a TEXT-based INDEX versus a CHAR(32)-based INDEX? Is my MD5
idea a waste of time? Is there something better than MD5? Would it be
better to store the "fingerprint" of the TEXT as an integer somehow, so
that the INDEX could be based on a INT* column?

Thanks for any help!

-Jon

--
-**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*---
Jon Lapham <la****@jandr.org> Rio de Janeiro, Brasil
Personal: http://www.jandr.org/
***-*--*----*-------*------------*--------------------*---------------
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #3

Similar topics

How to use index

by: Ken | last post by:

Dear all, I have a beginner of oracle and recently I have a prob with using indexes. I have a table with 5 columns, with no primary key set in the table. An index has created for column 1. I...

Oracle Database

What is the fastest way to count lines in a text file?

by: mesterak | last post by:

I want to very quickly count the number of lines in text files without having to read each line and increment a counter. I am working in VB.NET and C#. Does anyone have a very fast example on how...

.NET Framework

Joining 2 char to 1 short

by: Steffen Loringer | last post by:

Hi, I'm using the following function to join 2 char (byte) into one short on a 32 bit X86 platform: unsigned short joinUnsigShort(unsigned char a,unsigned char b) { unsigned short val = 0;...

C / C++

std::string to char[] with a twist

by: ruffiano | last post by:

Hello, I need your expertise to perform a conversion. I need to convert a std::string to a char. However, in the conversion, I need to get rid of the '/0' character. The initial string...

C / C++

Need help with strings and coloring text fast, please.

by: Andy011 | last post by:

Hi, I wanna know how to avoid the program to scan only the first letter of a string... I know it sounds real n00bish, but if you could post a simple example (SIMPLE, not like the help menu from c++)...

C / C++

122

"index" method only for mutable sequences??

by: C.L. | last post by:

I was looking for a function or method that would return the index to the first matching element in a list. Coming from a C++ STL background, I thought it might be called "find". My first stop was...

Python

Grid View, programmatically set gridview textbox property readonly=false

by: johnlim20088 | last post by:

Hi Someone please help me string typ; if (e.Row.RowType == DataControlRowType.DataRow) { typ = e.Row.Cells.Text;

.NET Framework

String.Replace method not usable when knowing index to replace

by: Lonifasiko | last post by:

Hi, Just want to replace character at index 1 of a string with another character. Just want to replace character at that position. I thought Replace method would be overloaded with an index...

C# / C Sharp

How to create an index fast?

by: hedonist123 | last post by:

Hi, I have a table with close to 30 lakh records. Now I wish to create an index on one of the columns. Earlier when I had tried to create an index on another column on the same table, the query...

DB2 Database

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General