hello,
here is my trouble, i would to like to write a program which could
help me to detect sequence of consecutive words in list in a very
efficient way. (i need to do it upon large amount of text, and for now
i'm looking for a good start point)
i have many "sentences" of variables sizes, for example one could be :
sentence1 = ['w43', 'w12', 'w41', 'w85', 'w4', 'w74', 'w24', 'w1',
'w6', 'w41', 'w85', 'w7', 'w23', 'w43']
where wx can be seen as a word.
then i had sequence of words which i need to recognize, for example :
sequences = [['w41', 'w85', 'w4'], ['w41', 'w85', 'w7'],] etc.
and need something like span position where sequences are present in
sentence:
{
['w41', 'w85', 'w4']: [[2, 4], ],
['w41', 'w85', 'w7']: [[9, 11], ]
}
for now i'm doing it by using many nested 'for loop', and it is too
slow, as i search for each sequence one by one, and then restart from
beginning, and so on, i'm wondering if this can not be done by using
nested dictionnary or something like that, maybe is there an
improvement to imagine ?
(i'm wondering if i should not find a way to not re-start from
beginning when i had found ['w41', 'w85', ] for the second time and
thus just only look after last token 'w4' or 'w7')
also i think this is related to some kind of finite state automata,
but i do not know how to build automatically massive parallel FSM (i
can have many many sequences and huge amount of sentences)
i was also wondering if some genetics algorithm used to align sequence
of protein could not help me, i had once read something about that ,
but still can not where to start. also as far as i remember such
algorithms take into account that 'sequence' could not be consecutive,
what i need is consecutive.
maybe one of you could help ? give a start point or even an idea that
could help me to design this piece of code ? 4 1944
joh12005,
your problem sounds as it may be solvable with mxtexttools.
they claim to be very fast in analyzing text and splitting into python
tuples. I never fully understood how to deal with them, sorry.
so just google for it and try it.
David Merz wrote "text processing in python" and dealt with this tools.
Sun,
Harald
Joh <jo******@yahoo .fr> wrote: also i think this is related to some kind of finite state automata, but i do not know how to build automatically massive parallel FSM (i can have many many sequences and huge amount of sentences)
here's a naive but clean way to do it
Write a finite state machine class that scans the sentence for a single
sequence. Then instantiate one of them per sequence, and run the
following loop (pseudocode)
for word, index in enumerate sentence
for fsm in sequences
fsm.advance(wor d)
if fsm.state == END
results[fsm] += (index - fsm.sequence_le ngth, index)
fsm.state == START
A possible optimisation is to combine all your sequences into a single
trie; however, this will also require backtracking so I'm not sure how
much faster it'd be in practice.
martin
Joh wrote: here is my trouble, i would to like to write a program which could help me to detect sequence of consecutive words in list in a very efficient way.
These repeats are called "tandem arrays." Algorithms for quickly
finding them (and many other useful patterns) are given in
Gusfield, Dan. 1997. Algorithms on strings,
trees and sequences. Cambridge: Cambridge
University Press.
Gusfield has a web site http://wwwcsif.cs.ucdavis.edu/~gusfield/
with downloadable software. I don't recall whether his web site talks
about tandem arrays, but I'm sure you could google the term...
Mike Maxwell
Mike Maxwell wrote: Joh wrote:
here is my trouble, i would to like to write a program which could help me to detect sequence of consecutive words in list in a very efficient way.
These repeats are called "tandem arrays." Algorithms for quickly finding them (and many other useful patterns) are given in
Apologies, I just re-read your original post, and (if I understand what
you are looking for), it is not tandem arrays. Nevertheless, Gusfield's
book is on a whole host of text search problems, and may still give you
leads (or even answers).
Mike Maxwell This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Jelle Feringa / EZCT Architecture & Design Researc |
last post by:
Could anyone recommend me a genetic algorithm package? So far I have found a
few, such as GAS, pyGP, Genetic, and of course scipy.ga
My problem is that most of the development of these packages seems to be
stalled, or that in scipy.ga's case, the module seems huge and somewhat
overly complicated. Could someone recommend me a module, which is still
being developed, somewhat more concurrent than the stuff I'm finding? Any
recommendations,...
|
by: John |
last post by:
I am trying to use the xml desinger in .net 1.1 to create
the follow schema. where USAddress is derived from Address
But when I generate the xml file, the .net xml editor says
"The active schema does not support the element 'street'"
The schema is from a msdn .net example.
What I am missing here?
Thanks
|
by: Bruno R. Dias |
last post by:
Perhaps it would be interesting to program a virtual machine simulating
an ancient computer (such as the pdp-7). Then, it would be rather
interesting to code for it (porting gcc to it maybe?). I think it would
be fun to play with the long-forgotten art of coding in machine language.
And what about a fictional computer, such as one that works on an
entirely different way (such as a non-binary computer)?
It wouldn't be very useful, but...
|
by: Xiao Jianfeng |
last post by:
Hi all,
I am looking for a genetic algorithms package for Python.
I have googled the web before posting and found some links. The link of
pygene(http://www.freenet.org.nz/python/pygene) cannot be opened.
I also tried the recipe on ASPN, but it is too simple for my
application, and the ga model in SciPy, which is in testing in the
"sandbox".
|
by: aaragon |
last post by:
Hello everybody,
I appreciate your taking the time to take a look at this example. I
need some help to start the design of an application. To that purpose
I'm using policy-based design. The idea is to have a Class that stores
elements of Class2 in different ways (arrays in stack memory, arrays in
heap memory, using the std::vector and so on). I would like the user
to customize the creation of a class with Class2 and StoragePolicy like...
| |
by: Anders B |
last post by:
I want to make a program that reads the content of a LUA array save file..
More precicely a save file from a World of Warcraft plugin called
CharacterProfiler, which dumps alot of information about your characters
into that save file.
Anyhow, I want to extract a couple of lines of it and save it into a
database and I need help on figuring out a good way of reading the file.
The problem is that the file can look pretty different depending...
|
by: =?Utf-8?B?TWlndWVsIElzaWRvcm8=?= |
last post by:
Hi,
I've built an ASP.NET 1.1 web service and an ASP.NET 1.1 application that
calls it
is throwing the following exception:
System.Web.Services.Protocols.SoapException: Server did not recognize the
value of HTTP Header SOAPAction: http://ipm.sitefactory.com/Authenticate. at
System.Web.Services.Protocols.Soap11ServerProtocolHelper.RouteRequest() at
System.Web.Services.Protocols.SoapServerProtocol.Initialize() at
|
by: DumRat |
last post by:
Hi,
I wanted to create a genetic algorithm for this question :
Suppose you are out shopping. You have to buy things from different n number of stores. And have to bring home the goods required from every store. There is an amount(specified in kg) of goods that you have to buy at each store and once at a store, you have to buy that total amount. Of course, carrying all this weight around is going to cost you energy. So, you want...
|
by: sessmurda |
last post by:
I am writing a fairly simple script that is supposed to print out lines from a txt file and group them based on where I want them. The general format of the files I am using the script on is as follows:
662376 |GCGG | | |
662375 |CGCC | | |
662374 |GCGG | | |
662373 |CATC | | |
662371 |TCCC | | |
662369 |CACCC| | |
662367 |TCTTT| | |
662365 |GCGGG| | |
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
| |
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| | |