Digital Library

cab1

 
Title:      IMPROVING THE SCALABILITY OF A GENRE-AWARE APPROACH TO FOCUSED CRAWLING
Author(s):      Guilherme Tavares de Assis and Marcos Vinicius Oliveira Souza
ISBN:      978-989-8533-82-1
Editors:      Pedro IsaĆ­as and Hans Weghorn
Year:      2018
Edition:      Single
Keywords:      Focused Crawling, Distributed Crawling, Content and Genre Terms
Type:      Full Paper
First Page:      159
Last Page:      166
Language:      English
Cover:      cover          
Full Contents:      click to dowload Download
Paper Abstract:      Focused crawlers have the greater purpose of crawling Web pages that are relevant to a specific topic of interest of the user, being important for a wide variety of applications. In general, they work by trying to find and crawl pages that are related to a particular topic of interest. In this context, a focused crawling approach was proposed and developed where the topic of interest can be expressed by terms that describe the content and the genre of the desired Web pages, enabling the construction of focused crawlers that perform effective and efficient crawling processes, as demonstrated experimentally. In order to improve the scalability of such genre-aware approach to focused crawling, this work proposes a new functioning architecture, where steps related to focused crawling processes can be carried out in a distributed form. Experiments have shown improved scalability of the approach in relation to its original form of functioning: in general, when using 8 computers, the time savings was, on average, 83.5% considering the total execution time of focused crawling processes related to two distinct topics of interest.
   

Social Media Links

Search

Login