Digital Library

cab1

 
Title:      HTML SEGMENTATION USING ENTROPY GUIDED TRANSFORMATION LEARNING
Author(s):      Evelin Carvalho Freire de Amorim
ISBN:      978-989-8533-09-8
Editors:      Bebo White and Pedro IsaĆ­as
Year:      2012
Edition:      Single
Keywords:      Information Retrieval, web page segmentation, Machine Learning
Type:      Full Paper
First Page:      11
Last Page:      18
Language:      English
Cover:      cover          
Full Contents:      click to dowload Download
Paper Abstract:      A HTML page can be represented by a data structure called DOM Tree, which is suitable for coding and presenting information. However, a DOM Tree is not suitable for search engine processing. In order to process a HTML page associated with a DOM tree, a web page is segmented into semantically coherent smaller items. There are many rule based approaches to segment HTML. In this paper, a Machine Learning strategy is applied to HTML segmentation. The machine learning algorithm used is the Entropy Guided Transformation Learning (ETL), which is a supervised learning algorithm for token classification problems. ETL has been applied to a wide range of problems, achieving state-of-the-art results in many of them, such as Part-of-speech tagging, Phrase-chunking, Named Entity Recognition, also being competitive to the state-of-the-art results in Clause Identification. The ETL strategy to segment HTML is composed by three main steps. The first step processes a set of DOM Trees and extracts the input features, which are comprised by DOM tree structural information. In the ETL model, DOM tree nodes are represented as token units and HTML segments are represented as token chunks. The second step generates token windows as required by the learning algorithm. Finally, the third step generates a model that can extract segments from different kinds of web pages. In order to evaluate the proposed model, some experiments were set up using web pages from the following web portals: Ig, Tech Blogs, and CNN news. Due to the huge volume of data, a cloud computing infrastructure was used to calculate the results. By using this infrastructure, we were able to simulate a similar environment of a large scale search engine, which is relevant because HTML segmentation is commonly used to improve results of large scale search engines. Normalized Mutual Information and Adjusted Rand Index were adopted as evaluation metrics. Values of 70% for NMI and 90% for AdjRAND were observed. These empirical findings indicate that the proposed approach is promising, since these results were reached by using very simple and general web page features.
   

Social Media Links

Search

Login