Relational Desktop Search

Intro

Ok, this has been all over wish lists for dozens of different 'Desktop Search' programs for the last couple of years, and while a million bright ideas have flown every which way, no one ever seems to be able to find much beyond a cool proof of concept, or a neat screencast. Tracker is constantly talking about the possibilities of other applications using it as a centralized metadata store, and that level of integration would be great, however I think there is a much cooler, much juicer, and much more powerful possibility out their.

One of the Beagle projects during the 2006 Google Summer of Code was to work on metadata browsing and linking. While this project touched on the potential of an intelligent and powerful desktop engine, it unfortunately never made it far enough to unlock it.

By relational desktop search, I don't mean a relational database, yes, it would be wonderful if our desktop systems had been built in such a way, but in the end, desktop data isn't stored in such a strict manner. However, relationships exist all over the desktop, and in may cases can 'chain' together to reveal a wealth of information about the initial source. While much of this data is already stored and/or available there are a few key relationships which are particularly hard to determine (mostly because of the difficulty with interacting with the wide variety of desktop applications)  for the sake of this discussion we shall assume uncovering those links could/can be done if a concrete purpose for them is discovered.

Overview

Up until this point I have been very vague about what exactly I am trying to describe. What I am talking about is a desktop search which is able to build relevancy in much the same way that the current web searches do, but extracting even more data from each link,and understanding the dozens of relationships to each item. On the web there are (grossly oversimplifying) 2 types of relationships.
The desktop can exhibit _hundreds_ of complex and varying relationships with each entity. However, these relationships are not step deep, lets take a PDF document for example.

Level 1
(Subject/Title/Keywords are going to be considered 'content' at the moment, although they could provide more insight with some better language tools)

These are standard relationships, and already tracked/stored and in most cases searchable, while often useful when the user is working on an advanced search, we want to utilize this power more inherently. Lets look at some of the relationships we could extract from the above data (assuming we have similar data for _most_ queryable entities)

 
 I'll stop here, as I think the point I'm trying to make is clear enough, the recursive relationship building could descend several levels, and give us a clear sense of 'topic' that relates to a document. This is particularly useful for instances like notetaking or project collaboration. There exists a relationship between several documents (they are all part of the same project) but their may be almost _no_ keyword relationship, but there is obviously a linked importance. I would expect that if these 4 documents are:
  * Pandas and their bamboo
  * Elephants and their poo
  * Tigers and Winnie
  * Alvan and the Chipmunks
  a query for Panda's would probably not find any of the other 3 documents on a first page of results, but they would be their, a page or 2 back. Also, we would get the users you were collaborating with on these documents (again, not high priority results, but in the result set).
  Also, say the user was browsing wikipedia at the time when the panda document was written, and one of the pages read was on pandas. That page would be returned, but behind the panda document, as the panda document has shown up in the related documents for the other 3 documents as well. At first this seems like an unusual or odd behavior, double counting a recursive relationship? However, because the panda document was related to more documents, which were related to documents about pandas, it is more likely to contain information _relevant to the user_ with regard to pandas. Yes, the wikipedia page is more keyword laden with the word 'Panda', and probably has more information about pandas, but it is obvious from the number of related documents (and their related authors) that the user has put more time and effort into the pandas document.
 
  It is important to also note that in the above scenario, the panda wikipedia entry would also have a relationship with the panda document, and the other 3 as well, but each level of recursion it takes to get a relationship _dramatically_ reduces its weight, so while the Panda document may have received 10 points for each document relating to it, the wikipedia page would have only received 1 for each.

Why Beagle?


I think that Beagle is uniquely situated as a desktop search engine to play host to some of the first _real working_ implementations of such a technology. While other systems have focused on the relational nature of data and tried to build first class objects to represent types based upon common desktop objects, Beagle has focused purely on getting and indexing the data, and that is what is needed for this to work. Complete and robust metadata about _lots_ of different elements of the user desktop, and a super-speedy query system for allowing the dozens of queries (against the indexes) needed to complete 1 'visible' query. In addition, Beagle is built around Lucene, a FullText indexing engine, the hardest part of this process is not recursing down the metadata of a 1st level object and firing off the related queries, its ranking results, and evaluating similarity. Specifically, extracting 'keywords' from a document, or its text is a big component of allowing the relationships to be content sensitive/relevant.

Now Beagle isn't 100% ready for this, there are still some major steps that would need to be taken before the Beagle engine could host such queries, the biggest of which is building a consistent and coherent way to pass in 2 indexed URI's and get their 1-step content relevance. ie. some sane way to evaluate 2 lucene documents full-text content. I know lucene has several mechanisms/means for doing this, its just a matter of integrating those into Beagle's higher-level system. A few smaller todo/wish list items:

Some Math


Not my strong point, but I would try to represent a items relevancy as :
(in latex that's: R(entity) = \sum_{a}^{b} { log_i (  \frac{qrv}{rqr} ) } )
where qrv = The Queries Recursive Value (sum)
rqr = Recursive Query Requests
R(entity) = Relevance of entity
b = level to recurse to (probably 4 or 5)


While exact values need to be determined for different relationships, heres a simple chart of an example



Ok, first, _all points from a relationship count_ even if there are _no_ terms in common (not just in the top 25). My thought on this is that if a user has 800 documents by a single author, or downloaded 20 documents from one site, everything from that site should have a _slight_ elevated importance. Now something like the E-mail, which served as the source of the document, is (by itself) extremely relevant, hence its high starting score, however, that number will still filter up to Pandas, as we do not want sources to end up outranking the more recent target.

The greatest issue is balancing content with time, my inclination as of this moment is to support a date-based multiplier on all final relevancy scores, based upon a file halflife of a week or so. However, this component of the equation still needs some thought, or maybe just some tweaking.


score(q, d) = Xt in q
tf (t in d)·idf(t)·boost(t.f ield in d)·lengthNorm(t.f ield in d)
(1)
where:
– tf(t in d) - Term Frequency - The number of times the term t appears in the
current document d being scored. Documents that have more occurrences of
a given term receive a higher score.
– idf(t) - Inverse Document Frequency - One divided by the number of documents
in which the term t appears. This means rarer terms give higher
contribution to the total score.
– boost(t.field in d) - The boost, specified in the query by the user, that should
be applied to this term. A boost over 1.0 will increase the importance of this
term; a boost under 1.0 will decrease its importance. A boost of 1.0 (the
default boost) has no effect.
– lengthNorm(t.field in d) - The factor to apply to account for differing lengths
in the fields that are being searched. Typically longer fields return a smaller
value. This means matches against shorter fields receive a higher score than
matches against longer fields.

Implementation

Ok, to realistically lets talk about how I might go about implementing not just some defined proof of concept, but a real relationship evaluation tool built upon the Beagle system. Usage of the MoreLikeThis class would be inevitable. My thought would be to find the 25 most common terms in a document set that matches a relationship subset. ie.

list l = List
foreach Document d in hit.alldocumentswithsameauthor
   l.add (build list of top 25 term's in d)

find most similar documents based on terms, and continue next recurse.

The key is a system which not only allows the 'root' of our recursive trees to gain relevance based upon its children, but to make sure all the children of the root are considered part of the return results. So in the following scenario
Hit 1 must contain relevancy boosts from Hits 3-6, and all hits must be in the result set, and in numeric order.

While a complete production implementation will need to take place close to the Lucene core for performance reasons, in order to asses the accuracy impact of this system, I will implement it as a simple Beagle Client, using the client API. While performance will be quite poor, this initial test is not to determine performance, but to build a complete list of what Beagle needs in place before something like this is a possibility.

NOTE! Just a skeleton check in, and some fiddling to see how well mono would handle listener loads, not much in terms of something usable yet!

You can see info about the bzr branch this is working in.
or just check it out at
bzr branch http://bazaar.launchpad.net/~kkubasik/beagle/RelationalQueryPOC

Common Tracked Data/Relationships

If any of the following are found, query the rest of them for relationships. This is the Author/Creator/Contributor relationship. (Eventually we should plan to differentiate weight based on more specific criteria. ie. a dc:contributor's documents weigh less than a dc:author match

"fixme:FileAs"
"fixme:GivenName"
"fixme:FullName"
"fixme:Name"
"fixme:Nickname"
"fixme:ImJabber"
"fixme:ImAim"
"fixme:ImIcq"
"fixme:ImYahoo"
"fixme:ImGroupWise"
"fixme:ImMSN"
"dc:creator"
"dc:author"

"fixme:identity"
"fixme:gotFrom"
"fixme:from"
"fixme:from_sanitized"
"fixme:from_name"
"fixme:from_address"
"fixme:to"
"fixme:sentTo"
"fixme:to_sanitized"
"fixme:cc"
"fixme:cc_sanitized"
"fixme:bcc"
"fixme:bcc_sanitized"
"fixme:to_name"
"fixme:cc_name"
"fixme:bcc_name"
"fixme:to_address"
"fixme:cc_address"
"fixme:bcc_address"
"parent:fixme:from_name"
"parent:fixme:to_name"
"parent:fixme:cc_name"
"parent:fixme:from_address"
"parent:fixme:to_address"
"parent:fixme:cc_address"
"dc:contributor"
"fixme:attendee"



We will also check anything within a day of each sources Timestamp.

One of the biggest (and at the moment, under-utilized/noticed in beagle) is the "Source" or "Owner"