By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,275 Members | 1,990 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,275 IT Pros & Developers. It's quick & easy.

Sequence of characters problem

P: 1
I have a sequence of characters. The characters are ordered in chronological order. I am looking for an algorithm to group the sequence and remove errors in the data. I am having an hard time to explain in words/maths what the requirement of fixing the sequence is but something like "the outcome is to group as many characters as possible in a chain of constant letters" and minimum sequence length should be a setting. Below I am trying to visualize by some examples what I mean (3 chars minimum):

AAACCAACA => AAAAAAAAA

ACACAAAAA => AAAAAAAAA

AYYYYYYYA => YYYYYYYYY

YYYYAAYYY => YYYYYYYYY

AYAYAYYYY => YYYYYYYYY

And for longer sequences it becomes a little more tricky

AYAYAYAYAAAAAAAAAAYAYYYYYYYYYYYYYY => AAAAAAAAAAAAAAAAAAYYYYYYYYYYYYYYYY

HTHTTHHHTTHHH => TTTTTHHHHHHHH

TTOAOAOOAATTA => OOOOOOOOAAAAA

This is bit tricky because even thought the 3 O's aren't directly next to each other, they are still the most probable uninterrupted sequence of characters.

Does anyone know of any algorithm (Machine learning?) or similar (term of this problem, problem name) that can do this type of error-code correction?

Thank you in advance
1 Week Ago #1
Share this Question
Share on Google+
1 Reply


Rabbit
Expert Mod 10K+
P: 12,392
If you want to write a rules based engine, then you'll need to come up with more precise rules. For example, you will need to define why AYAYAYYYY maps to YYYYYYYYY and not AAAAAYYYY.

If you want to write a fuzzy matching algorithm, then there's plenty to choose from, n-gram matching and levenshtein distance comes to mind.

If you want to use machine learning, as in a neural network, then you will need tens of thousands, if not more, of manually verified examples to feed the network for training.

Or, don't fix the data, instead, fix the source of the error. Go back to the system before you receive the data and fix the issue that is sending you incorrect data.
1 Week Ago #2

Post your reply

Sign in to post your reply or Sign up for a free account.