473,799 Members | 2,693 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Information, code, or reading about machine-based textual analysis and classification?

I'm hoping someone here can point the way toward a fairly specialized topic.
I have large amounts of content that need to be classified. Irrelevant or
"uninterest ing" (by our criterion) articles need to be disposed of.
"Interestin g" articles should be tagged with a number of points of metadata.

As much as possible, I would like machines to do this work. Fairly dumb
ways of doing this might include using our existing databases/metadata as
keyword collections and classifying based on brute-force scans and matches
against our content (e.g. strings "Department of Laboratory Medicine",
"Dr. James Fine." Smarter ways may have been suggested by some of the
presentations at the recent "Google Developer Day" in Mountain View:
programmatic analysis of seed texts led to mechanisms of analysis that
seemed much more efficient than raw text scanning.

I am speculating based on no real knowledge but I would imagine it would be
possible to develop some kind of "relevance index" for an item as compared
to an existing body of text, and keep or dump based on a threshold. More
interestingly, maybe I have classified a thousand articles as say, "UW
biomedical research," and there is an algorithmic means by which we could
assess the "UW biomedical research-ness" of an unknown text. That would be
very useful.

Are there resources or readings I can be looking at? Are there any
pre-existing libraries or frameworks or tools that could ease this task? The
content lives in MS SQL Server 2005 or can be placed in it; Index Server is
installed on my servers; maybe there are things within these tools that can
help.

Thanks in advance for any leads you can offer.

-KF
Jun 20 '07 #1
0 994

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
1655
by: MJL | last post by:
I am working with an open source piece of software that is overwhelming me and I was hoping to get some advice. It is in some areas simple and others nearly impossible. There are many includes and functions scattered among many files in many directories. While the code itself is not to hard to follow, there is virtually no commenting or documentation. I am working on a Windows XP machine, but I am comfortable and have access to a...
1
8392
by: Omer Ahmad | last post by:
Hi All, I've been working with python for about 6 months now, and have been very impressed with the size and scope of the libraries. I have, however, run into a bit of a problem. I discoverred Marc Hammonds PyWin32 extensions, (whcih are awesome) and Tim Golden's WMI wrapper for accessing the Windows Management Instrumentation (Win32_Classes) but now I have been asked to remove these dependandcies and still obtain machine...
0
970
by: Bengt Richter | last post by:
I just read Guido's http://www.artima.com/weblogs/viewpost.jsp?thread=86641 "Adding Optional Static Typing to Python -- Part II" and I it struck me that if you step back to a more distant perspective, you can see specific source syntax proposals as a special case of composing program information, and I wondered if the coupling inherent in editing a particular source to add information is always best. Certainly it makes for a handy...
3
1472
by: Gianni Mariani | last post by:
This is one of those, hugh ? moments. So, GCC behaves just like I would kinda expect it to but it looks VERY strange. It's one of those things that could cause silent strife if you included files in the wrong order. class X;
12
1649
by: Oliver Knoll | last post by:
Ok, I've searched this group for Big/Little endian issues, don't kill me, I know endianess issues have been discussed a 1000 times. But my question is a bit different: I've seen the follwing function several times, it converts data stored in Big Endian (BE) format into host native format (LE on LE machines, BE on BE machines):
25
2137
by: Alvin Bruney | last post by:
C# is great but it does have some short comings. Here, I examine one of them which I definitely think is a shortcoming. Coming from C++, there seems to be no equivalent in C# to separate code cleanly for the user's benefit. Why is this important? Because a user gets to maintain this code, not a machine. C++ exposed us to header files which was a way, among other things, to cleanly separate class implementation from declaration. Why was...
8
1364
by: garyusenet | last post by:
Program is witten in C++ and runs on my windows xp computer. It is a game, but I have been struggling with this project for weeks now so would appreciate some help - although I understand what i'm trying to do is quite advanced. I'm doing this more as a learning excercise where the means is more important than the end. When you press 'caps lock' in the programme a heads up type overview is placed on the screen. The overview consists of...
1
1509
by: Victor | last post by:
Hi Guys. I have a question about which is the best way to store user info across the whole website. Now I have 3 web servers and each server has enabled the web garden (6 wps). now I want to save the current user's detail across the whole website without using session. I am considering saving in HttpContext.Current.Items or NamedDataSlot in each thread. which one is more reliable (or better) in this situation? Or any reference of the...
19
2315
by: Hapa | last post by:
Does only reading (never writing) of a variable need thread synchronisation? Thanks for help? PS. Anybody knows a Visual C++ news group?
4
1233
by: crazyhouse | last post by:
I am using a custom fuction (I got the information from Microsofts site for the code) Function Median (tName As String, fldName As String) As Single Dim MedianDB As DAO.Database Dim ssMedian As DAO.Recordset Dim RCount As Integer, i As Integer, x As Double, y As Double, _ OffSet As Integer Set MedianDB = CurrentDB() Set ssMedian = MedianDB.Openrecordset("SELECT FROM WHERE IS NOT NULL ORDER BY ;")
0
9689
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9550
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
1
10248
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10032
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9085
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7573
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5469
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5597
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
3764
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.