Skip to main content

We are proud to share that in October this year, our researchers Benedict Yeoh and Wang Hui Juan presented their paper GROWN+UP: A “Graph Representation Of a Webpage” Network Utilizing Pre-training at the 31st ACM International Conference on Information and Knowledge Management (CIKM 2022, Atlanta, USA).

Large pre-trained neural network backbones are ubiquitous in the domains of image processing (i.e. ResNet, ViT) and NLP (i.e. BERT, GPT). These models have shown state-of-the-art results on various tasks with many interesting applications. However, there is a stark contrast in the lack of similar work done for webpage information retrieval, even though webpages on the Internet number upwards of 4 billion in 2021.

We introduce GROWN+UP, a one-size-fits-all webpage parser to extract useful webpage features. GROWN+UP is short for “Graph Representation Of a Webpage” Network Utilizing Pre-training, where HTML webpages are ingested by a Graph Neural Network (GNN) based feature extractor that can be used for distinct downstream tasks (Fig. 1). GROWN+UP, with fine-tuning on small datasets, outperforms other methods on a number of task-specific benchmarks.

Fig. 1: GROWN+UP at a Glance

GROWN+UP holistically parses the text, media content, page layout, visual representation and the relationship between elements of a webpage. This is done by representing the DOM of a webpage as a graph where element features are embedded as node features (Fig. 2).

Fig. 2: Left: A DOM Element and its Features Embedded as Node Features. Right: A HTML DOM Represented as a Tree. 

The element features as well as the adjacency tensors of the graph are then passed into a deep GNN-based feature extractor (Fig. 3).  The first part of the feature extractor consists of a graph convolutional block, a linear block and finally a Long Short-Term Memory (LSTM) layer. This series of blocks are repeated to increase network depth as well as the receptive field of the graph convolution. Residual connections and the LSTM layers help to mitigate any over-smoothening issues. A series of transformer blocks are also added to process graph-level features.

Fig. 3: GROWN+UP Feature Extractor architecture.

What sets GROWN+UP apart from other related works that target only text content is that it learns not only DOM element features but useful features of the entire webpage. To this end, the model is pre-trained in a joint self-supervised manner on the following tasks on the 180k webpages of the CommonCrawl 2008 archive:

  1. Masked Node Prediction
  2. “Same-Website” Prediction

The first pre-training task is inspired by BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (BERT). We mask input nodes and predict the following node features based on the surrounding graph context (Fig. 4):

  • Tag Type
  • Text Embedding
  • ID Embedding
  • Class Embedding
  • Number of Child Nodes
Fig. 4: Masking scheme for Node Prediction task

The second task, “Same-Website” Prediction is to discern “structurally similar” webpages, an example of which can be seen in Fig. 5a. Manually labelling webpages would be prohibitive in terms of time and resources. Instead, a proxy was used where webpages from a website were grouped using their URLs; URLs coming from the same domain and sharing the longest sub-paths were considered to be “similar”, as shown in Fig. 5b. Webpages that fulfil such requirements are visually similar. Through pre-training on this task, GROWN+UP is able to generate useful graph-level features for downstream tasks.

Fig. 5a: An example of “Structurally Similar” webpages from different websites

Fig. 5b: An example of the Proxy: two webpages from the same domain and share the longest sub-path

As proof that our general-purpose webpage parser can solve a variety of downstream tasks, we benchmark on the following unrelated tasks:

  1. Boilerplate removal: Extraction of main textual content from a webpage
  2. Genre classification: Classification of a webpage into a single genre

These represent a node-level task and a graph-level task respectively. In each of these tasks, we benchmark GROWN+UP on two datasets against previous works that are either rule-based or engineered specifically for the task. The benchmarks are run multiple times and standardized to improve the robustity of results. In both benchmarking tasks, GROWN+UP outperforms previous work (Table 1), lending credence to the claim that GROWN+UP is able to serve as a general and effective webpage parser.

Boilerplate Removal Genre Classificiation

Table 1: GROWN+UP Benchmark

Unstructured forms of data such as webpages represent a treasure trove of data that is hard to extract via traditional means. GROWN+UP represents the first reported one-size-fits-all parser for websites. The research establishes a framework for parsing websites and also demonstrates the efficacy of the method on two unrelated tasks.