[RDF] keywords idea
Jonas Liljegren
jonas@paranormal.se
29 May 2001 09:42:33 +0200
--=-=-=
Something to comment on...
--=-=-=
Content-Type: message/rfc822
Content-Disposition: inline
Path: tobix.suptra.org!not-for-mail
From: Tobias Brox <tobiasb@suptra.org>
Newsgroups: comp.infosystems.search,comp.databases,comp.programming
Subject: Searching, indexing and scoring; call for comments
Followup-To: comp.databases,comp.infosystems.search
Date: 3 May 2001 11:04:22 +0200
Organization: Bestumvn 35, annen etage
Lines: 139
Message-ID: <keyworddesign1@tobiasb.suptra.org>
NNTP-Posting-Host: tobiasb.tobix.suptra.org.invalid
X-Trace: tobiasb.invalid 988880662 2381 127.0.0.1 (3 May 2001 09:04:22 GMT)
X-Complaints-To: newsmaster@suptra.org
NNTP-Posting-Date: 3 May 2001 09:04:22 GMT
Summary: Some thoughts about how a general purpose system for searching, indexing and scoring should look
Keywords: score,scoring,scoring system,fuzzy,fuzzy logic,fuzzy ai,indexing,search engine,data mining,jukebox,mail sorting,file organizer
User-Agent: tin/1.5.8-20010221 ("Blue Water") (UNIX) (Linux/2.2.16-22 (i586))
Xref: tobix.suptra.org comp.infosystems.search:17 comp.databases:116 comp.programming:296
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Follow-ups goes to comp.databases ... unless somebody have better
suggestions?
I'd like to discuss some design on a indexing/scoring software module I
eventually want to write (see also news:kbswmodule1@tobiasb.suptra.org).
The system should be able to run stand-alone, i.e. like an internet
search engine, but it should also be possible to incorporate it into
applications, like mail readers, jukebox systems, etc.
The main idea is that it some human beeing that wants to find one or
more objects - it might be music, images, software documentation ...
just anything. The person might search for something specific, i.e. a
mail he got in the mailbox some weeks ago - or it might be something
unspecific, i.e. an interessting film or some good music to listen to.
First I want to discuss how this system should ideally should be
designed to make it possible to find just excactly what somebody might
be searching for. I've been thinking a bit about how to make a solution
that actually can work and scale, and also about how to present it to
the user - but let's postpone discussing details.
I've been thinking very hard - so I really want some feedback. Am I a
genious, or should I rather report to the nearest hospital for mentally
ill people? Does anybody understand my thoughts at all? Or is this
article just too long and boring to read?
- Simple system with hierarchical keywords
The system should contain a lot of fixed keywords in the database.
Those keywords are administrated by some intelligent person/group.
Keywords are assigned to relevant articles/URLs/whatever either
automaticly, semiautomaticly or manually.
The keywords are hierachically sorted, making it easy for a user to
navigate through them, select the apropriate, and find what he is
looking for.
The keywords might tell not only about content, but about anything else
as well - language, content type, etc.
Some administration is needed to keep a bit of quality and consistancy
at the keywords, and to keep the hierarchy in order.
A person that wants to do a search, can select some keywords and get up
some results. If there are too many results, he can constrain the
search by adding more keywords, or he can loosen up by deleting some
keywords from the search.
- Networked keywords
Often it will be uncertain where in the hierarchy to put keywords - and
where to search for them. One keyword might often fall into more than
two categories, so every keyword can have more parents.
- Scoring/voting
Sometimes it's a bit fuzzy if a keyword fits or not. I.e., some music
are beeing played in a pub, and one person says: "This sounds like
blues to me". The second person says "Blues? Not at all!". The third
has never considered it to be blues, but he might agree a bit.
This can be solved through scoring. I think we should have a voting
system where the users can give an integer vote (i.e. between -16 to 16)
where a negative value means "I disagree" and a positive value means "I
agree". Then either the sum of the votes or the average vote can be
used to determinate whether it's blues or not.
- Integer keyword assignment
Sometimes it makes sense searching for the right number rather than just
the right keywords. I.e., some radio host might want to search for a
piece of music that is exactly 4'15" long to fill up the last gap in his
program before the news. Or somebody might want to filter out all
emails that are bigger than 30k. Or they might want to buy some car
that costs less than 5k$. I guess we can keep one integer to all
keyword assignments, where the number either can be a score or a vote.
- Votings on the keywords and keyword links themselves
Instead of having some administration that looks through that the system
only contains sane keywords, and that the links between the keywords are
sane, we can just as well let the users themselves handle it. Let
people vote on new keywords and new keyword links. The links and
keywords that gets a negative score, might be wiped away from the system
after a while - the same with keywords and links that nobody uses.
- Combination of keywords
Let's say that "Persons/By Name/Tobias Brox" is one keyword in the
system, while "Metainformation/Author" is another keyword. How to find
all articles where Tobias Brox is the author through this system? I can
see two alternatives, either having a keyword "Tobias Brox as Author"
that can have both "Metainformation/Author" and "Persons/By Name/Tobias
Brox" as the parents, or we can put a reference to "Tobias Brox" when
assigning the "Metainformation/Author" keyword to the article.
- Rules
Sometimes it might be needed to do complex searches, like "those
keywords, but not this one or this other one". One alternative might be
to let the user create new keywords defined by a rule. I once made a
weird system for sorting my mail, where all keywords (they were flat; a
list of ~100 keywords) were automaticly assigned and scored through
rules; most of it was rules about headers, regexps from the message
body, etc, but at last I wanted to select mail by applying only one
keyword.
I made a system where it was possible to apply some few rules based on
two keywords A and B;
"If A then B"
"If not A, then B"
"If A, then not B"
"If not A, then not B"
"B = A xor B"
By specifying the order of those rules, it was actually possible to
specify any kind of logic, and - as a result - fetch out any collection
of mail I wanted by choosing only one keyword.
- Distribution
It would be nice to have some standard way of exchanging and sharing
smart keywords between different systems, and at the same time enforce
that the truely local keywords stay local.
--
Tobias Brox - freelancer for hire!
Programming, system administration, etc
+47 98660706 / tobiasb@suptra.org
--=-=-=
--
/ Jonas - http://jonas.liljegren.org/myself/en/index.html
--=-=-=--