Query difficulty classification using search session length

Samuel Hertzberg

Abstract
For data to be successfully utilised, it must be indexed and readily available, and for data to be useful, it must be searchable. To make search solutions better, it is often in the search providers interest to have some insight and understanding regarding their users' behaviour. One such insight can be produced by query difficulty prediction. Query difficulty is a metric to determine how well a query will perform in a search setting. For example, if a query is very ambiguous, it might have a high query difficulty and perform poorly. Several methods exist for determining query difficulty since its results can be of great value. In this thesis, a new way of predicting query difficulty, based on machine learning and search sessions lengths, is implemented and tested using a data set from Scania, a Swedish manufacturer of vehicles, primarily trucks and busses. Using search logs from Scania's intranet, search sessions were grouped and two machine learning models, a support vector classifier (SVC) and a stochastic gradient descent classifier (SGDC), were trained to predict the query difficulty as one of two classes. If a query is session-terminating, it is regarded as a query of lower difficulty since the user found what they were looking for. If a query is session continuing, the query is regarded as having greater difficulty since the user did not find what they were looking for. To test this novel definition, it was placed in a machine learning context and used in two classifiers; the two models were compared to a simpler Naïve Bayes classifier on four different variations of the data. After the experiments, it became apparent that the choice of model and preprocessing techniques played a large part in the final accuracy of the models. Considering the main data variant, preprocessing without noise reduction, the SVC produced the best balanced accuracy. The model produced a balanced accuracy of 58\%. It was also found that the SVC performed better than the SGDC and the baseline in all data variations using balanced accuracy as a metric. However, using positive predictive value and negative predictive values, all models produced almost equal results. The accuracy of the models was lower than expected; this is theorised to be mainly due to the noisy nature of the data set. More research is needed to evaluate search session-based query difficulty prediction on different data sets.