Denna tjänst avvecklas 2026-01-19. Läs mer här (länk)
This thesis explores the possibilities of creating a robust Web Scraping algorithm and presents a proposed algorithm to solve the issue. The algorithm is intended to be used on websites that have a repetitive HTML structure containing data that can be scraped. A repetitive HTML structure often displays; news articles, videos, books, etc. This creates code in the HTML which is repeated many times, as the only thing different between the things displayed are for example titles. A good example would be Youtube. The scraper works through using text classification of words in the code of the HTML, training a Support Vector Machine to recognize the words or variable names. Recognition of the words surrounding the sought after data allows for potentially robust scraping in the future, as small changes can be made in the code, however, the words most likely remain similar. To evaluate its performance a web archive is used in which the performance of the algorithm is back-tested on past versions of the site to hopefully get an idea of what the performance in the future might look like. The algorithm achieves varying results depending on a large variety of variables within the websites themselves as well as the past versions of the websites. The best performance was achieved on Yahoo news achieving an accuracy of 90 \% dating back three months from the time the scraper stopped working.