Tuesday, November 5

Web Content Scraping With Jsoup

Introduction: Jsoup is a java library that can parse Html from URL, File and String. It can manipulate HTML element, attribute and text. It extracts data using DOM or CSS selectors.

 jsoup jar  can be downloaded from http://jsoup.org/download.

The following example connects to NBC news website (http://www.today.com) and extracts Headline news and stories.



import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class webScraping {

            public static void main(String[] args)
            {
                        try
                        {
                       
                                Document doc = Jsoup.connect("http://www.today.com/").get();
                               String title=doc.title();
                                                System.out.println("title:" + title);
                                                System.out.println("\n");
                                               
                                                System.out.println("Top Headline News:");
                                                System.out.println("\n");
                                               
                         //Headline News
                                                Elements lis=doc.select("li[class=interactive] a ");
                                                for(Element li:lis)
                                                {
                                                            System.out.println( li.text());
                                                            System.out.println("Link:" + li.attr("href"));
                                                            System.out.println('\n');
                                                           
                                                }
                                               
                                                //Stories
                                                Elements divs=doc.select("div[class=story] a");
                                                for(Element div:divs)
                                                {
                                                            System.out.println("Story:" + div.text());
                                                            System.out.println("Link:" + div.attr("href"));
                                                            System.out.println('\n');
                                                           
                                                }
                                               
                                    }
                        catch(IOException ex)
                        {
                                    ex.printStackTrace();
                        }
              }
            }
           

The connect(String url) method creates a new Connection, and get() brings and parses a HTML file. The select method, which is used to query based on intelligent jsoup query language. In this example we are looking for div tag  with class called ‘story’. It also has an attr method, where, for a given element we can retrieve a specific attribute, in this example, we are retrieving href attribute of “a” link html tag.



Output:
title:TODAY - Latest News, Video & Guests from the TODAY show on NBC

 Top Headline News:

 WNBC: Democrat de Blasio projected to win NYC mayor race
Link:http://www.nbcnewyork.com/news/local/NYC-Mayor-Race-Democrat-Bill-de-Blasio-Joe-Lhota-Republican-230579591.html?asdfasdfadfad

 Obama heads toTexas to sell health care law
Link:http://www.nbcnews.com/health/obama-heads-high-stakes-texas-sell-health-care-law-8C11538530

 Pope Francis polls Catholics on divorce, same-sex marriage
Link:http://worldnews.nbcnews.com/_news/2013/11/05/21320327-pope-francis-latest-surprise-a-survey-on-the-modern-family?lite

 Wrong page? Zombies invade Fox News website
Link:http://www.nbcnews.com/technology/wrong-page-fox-news-website-appears-be-running-test-content-8C11535730

 Relax! No real proof bacon will hurt sperm
Link:http://www.nbcnews.com/health/no-real-proof-bacon-can-hurt-sperm-so-let-your-8C11535732

 I wanted to ‘protect’: LAX officer recounts attack
Link:http://usnews.nbcnews.com/_news/2013/11/05/21312566-i-came-to-the-tsa-to-protect-people-injured-officer-recounts-lax-attack?lite

 Retailers want to make ?showrooming? a no-show
Link:http://www.nbcnews.com/business/retailers-want-make-showrooming-no-show-8C11535653


Story:Skydivers will jump again, Michelle Knight reflects on captivity
Link:http://www.today.com/news/todays-takeaway-skydivers-will-jump-again-michelle-knight-reflects-captivity-8C11535663

 Story:NBC News: Christie to win N.J. governor's race
Link:http://nbcpolitics.nbcnews.com/_news/2013/11/05/21322371-christie-poised-for-big-re-election-win-in-nj-with-2016-on-the-horizon?lite

 Story:Runner Joy Johnson, 86, dies one day after New York City Marathon
Link:http://www.today.com/news/famed-runner-joy-johnson-86-dies-one-day-after-nyc-8C11535662

 Story:Photos capture close bond between conservationists and lioness
Link:http://www.today.com/pets/photographs-capture-incredible-bond-between-conservationists-lioness-8C11535735


No comments:

Post a Comment