ClickLogs: Goals and Research questions

How relevant is a document for a given search term? This is the central question when we try to answer a user's search query. In a decentralized setting like when searching torrents, we mainly have to work with metadata like titles: The actual content is not available at search-time, so it cannot be analyzed to better predict relevance. However, user feedback is a precious resource in estimating relevance. Particularly implicit feedback, i.e. feedback which the user provides simply by using a system, can be a great benefit because it can be collected cheaply and at a large scale. This package aims to improve search quality in Triber by collecting and processing a particular kind of implicit feedback: click log data.

Click logs contain information about which document was selected after searching for which keyword, and at which position in the result list the selected document was positioned. This data can be used to improve future search result lists; the easiest scenario would be a case in which we assume that for a given keyword, it's always the same torrent that's selected although it's displayed at rank #16. A simple way to improve search results would simply be to always put this torrent on the top of the list for the particular keyword. Of course, such situations will turn out more complicated in reality, and how to optimally address them is one of the questions of this research project.

The goals of this project are three-fold:

  1. Improve the search experience for Tribler users by reranking search results using clicklog data.
  2. Explore and compare reranking algorithms using the obtained data.
  3. Gain experience with the dissemination and processing of user-generated metadata; lay foundations for a distributed tagging system.

1. Improvement of Tribler Search

Find what you're looking for, and have a pleasant experience while you're at it - this should be the offer of Tribler's search engine, particularly towards average users which do not want to browse many pages of results or become particularly smart about search terms. The aim of the clicklog approach is to support both search experience and quality:

1.1. Integrating reranking mechanisms into Tribler Search

The main step is to rerank search results based on latent information in the clicklog data, as explained above.

1.2. "Suggest" Feature using database of frequent keywords

Another use of clicklog data is a search field offering "auto-complete" or rather "suggest" functionality; while the user it typing, the program offers possible extensions to the already typed letters.

2. Exploring Reranking Algorithms

We are aiming to examine the influence of different reranking algorithms / of using reranking at all. Therefore, it is planned to randomly pick a reranking algorithm at search time and save it with the clicklog data. If we can accumulate enough data, we should be able to make a statement about the use of reranking by regarding, in the simplest case, the average position of clicked results; if clicked results are significantly higher for reranked results than for un-reranked results, reranking should be considered successful.

3. User-Generated Meta-Data

3.1 Spam Issues

Tribler's server-free infrastructure creates a different situation than for most other clicklog applications; normally, clicklog analysis is performed by the owner of a centralized search engine; with Tribler, the only clicks really observed are those of the local user; other clicks are only received via the BuddyCast protocol. Since these transmissions might be faked, it has to be examined in how far faked clicklogs could be used to, e.g., promote spam, and what counter-measures are appropriate.

  • A very simple approach for boosting a torrent's search position might be to transmit information about how it was selected for just about any keyword the spammer would like it to be associated with. However, the TfIdf weighting scheme introduced in the previous section counters such an attack because torrent/keyword cooccurrences are linearly weighted by the number of keywords associated with a torrent.

How do these issues scale, and which additional counter-measures are required when we are facing explicitly given tags in the future?

3.2 Privacy Issues

In the current phase, the goal is to overcome the technical difficulties in distributing and using clicklog data, i.e., a technological one. However, on a conceptual level, we have to ask ourselves under which situations clicklog data might be harmful in terms of privacy, and which countermeasures could be taken. Such questions will be explicitly addressed in a later step of the project.

Contact: Nicolas Neubauer, Neural Information Processing Group, TU Berlin [‚Äčneubauer@cs.tu-berlin.de]

Implementation Notes

How to enable query logging

As of rev. 10204, you may enable logging of search keywords by adding sth like

RemoteQueryMsgHandler.getInstance().setLogFile("c:\crawl.txt")

to startAPI in Tribler/Main/tribler.py

Clicklog wire protocol

As of overlay version 8, the preference part of buddycast messages looks like this:

[["hash1", ["linux","ubuntu"], 1, 2], ["hash2", ...], ...]

when it would've looked like this:

["hash1", "hash2", ...]

earlier on.

This means that instead of a list of hashes, a list of lists is returned. Each inner list contains the hash, keywords used to find the corresponding torrent, the click position (where 1 is in fact the second position in the list) and the number of the reranking strategy used (here: 2).

Important methods in the clicklog data lifecycle

The data for this message is first stored in Core/CacheDB/SqliteCacheDBHandler:MyPreferenceDBHandler.addClicklogToMyPreference

The message itself is created in Core/CacheDB/SqliteCacheDBHandler:MyPreferenceDBHandler.getRecentLivePrefListWithClicklog

On the passive part, it is digested (as well as pre-OL 8 buddycast data) and converted into a more comfortable dict for further internal processing in Core/CacheDB/buddycast.py:BuddyCastCore.createPreferenceDictionaryList

The resulting data is finally stored in Core/CacheDB/SqliteCacheDBHandler:PreferenceDBHandler.addPreferences

Overall Todos

  1. Create database structure to store relevant data (search terms used for and position of own and peers' downloads, from here on "clicklog data") (done)
  1. Extend interface to store own clicklog data (done)
  1. Extend BuddyCast Protocol (done)
    • Active: Attach clicklog to each transmitted preference hash
    • Passive: Receive and store clicklog data
  1. Test performance (done)
    • simulated data
  1. Use clicklog data (done)
    • AutoComplete in search dialog
    • Rerank search results
  1. Track distributed clicklog data (waiting for deployment as of April 2009)

1. Database

  • A note on indices: Maybe I'm a bit too generous with those, but I think we need them unless we want to keep the tables in memory...
  • Both own and others' search terms are stored in Search. I use a peer_id of 0 for identifying the actual user's entries.
  • Torrents are removed from MyPreference when the torrent itself is removed,- this means the corresponding search terms and position info should also be removed? Or might it make sense to keep this info anyway, somewhere else?

2. Store clicks

  • On starting a download, store now additionally click_position and the currently used reranking strategy into MyPreference.
  • Also get the current query terms and store them into Search and Terms
  • By the way: By storing clicklogs in Preferences, we're throwing away any clicklog data leading to YouTube, as these dont show up there, right? But if I got it right, this is analogous to BuddyCast which works on MyPreference, only taking into account torrent downloads, right?

3. Extend BuddyCast Protocol

Create and parse an additional data structure passed on with each BuddyCast transmission. This structure contains the clicklog data for the n most current torrent downloads, where n is equal to the number of torrents transmitted by BuddyCast in the first place.

The constants

MAX_KEYWORDS_STORED = 5

MAX_KEYWORD_LENGTH = 50

define the maximum number of terms per torrent / chars per terms that are accepted by a client in order to prevent spamming.

4. Stress testing

After a number of improvements, time requirements are down to an acceptable size, I believe. See attachments for measurements performed with a dataset of search queries collected earlier on by the Tribler team.

5. Using the data

AutoComplete

  • performs a simple "like 'x%'" query and proposes the most popular completion of the entered characters by adding them to the text field and marking them

Reranking

Right now, 50% of the time (choosing by hours ;) no reranking takes place, and 50% of the time a simple heuristic is applied: The top and the second entry returned for a search query may be swapped if:

  • both torrents have at least MIN_SEEN_BEFORE_RERANK [5] clicklog records
  • the first torrent is less than MAX_POPULAR_RATIO [5] times more popular then the second one
  • the second torrent has a higher average position score. Each click gets a "position score" of 1-1/(1+click position)

The first two criteria are supposed to prevent reranking from being applied without a proper data basis, whereas the third criterion is the actual reranking criterion: If a torrent is consistently selected in a bad position, we might want to move it up.

Earlier considerations on reranking

Different reranking algorithms inlude

  1. No reranking
  2. Rerank by number of cooccurrences
  3. Rerank by TfIdf

Implement heuristics for improving rankings. Very generally, rank documents higher if we know peers found them using the current search terms.

  • By absolute times found using the current search term by peers, or
  • By times found using current search terms by peers, normalized by number of peers who have this torrent at all to avoid "Harry Potter phenomena" and discount spam attacks claiming their file to be relevant for any search term
  • How would position of previous clicks come into play here? Would it?
  • Could we use the information about the user search terms here as well?

In order to be able to make any statement about the effect and quality of reranking algorithms, we should swap them randomly (even if we only use one actual reranking strategy, we should swap it with strategy 0, the current, non-existing one).

We could do more elaborate learning if, for a click on position 3, we would also store the non-clicked torrents. I guess this is out of the question traffic-wise..?

Code-wise, this means writing alternative implementations of Core.Search.KeywordSearch and pass them on to Core.Search.SearchManager. Preferably, SearchManager would receive a couple of instances to choose from, each with a "getRerankingStrategyID()" or so, choose one at random at every search (while evaluating) and make the choice readable for later logging.

6. Tracking the data

  • Current state: Waiting for 5.0 to be deployed
  • There's a lot to be said about
    • which keywords were used for which torrent,
    • which keywords were used most frequently (this brings to my attention that the family filter would probably have to check the autocomplete...)
  • Of course, we would like to see if the reranking improves position. This is where the reranking_strategy comes into play. There should be some significant differences in average click position between reranked and non-reranked results (or between the different reranking strategies). Besides the possibility for fancy graphs, this allows for an informed decision which reranking strategy to finally use in a later version of Tribler.
  • Actually, any analysis we might perform on aggregated clicklog data could also be a funky analysis in the regular Tribler client, performed on the peer data available to the user. Just as a side node.