Introduction: Jsoup is a java library that can parse Html from URL, File and
String. It can manipulate HTML element, attribute and text. It extracts data
using DOM or CSS selectors.
jsoup jar can be downloaded from http://jsoup.org/download.
The following example
connects to NBC news website (http://www.today.com)
and extracts Headline news and stories.
import
java.io.IOException;
import
org.jsoup.Jsoup;
import
org.jsoup.nodes.Document;
import
org.jsoup.nodes.Element;
import
org.jsoup.select.Elements;
public class
webScraping {
public static void
main(String[] args)
{
try
{
Document doc = Jsoup.connect("http://www.today.com/").get();
String title=doc.title();
System.out.println("title:" +
title);
System.out.println("\n");
System.out.println("Top Headline News:");
System.out.println("\n");
//Headline News
Elements
lis=doc.select("li[class=interactive] a
");
for(Element
li:lis)
{
System.out.println(
li.text());
System.out.println("Link:" +
li.attr("href"));
System.out.println('\n');
}
//Stories
Elements
divs=doc.select("div[class=story] a");
for(Element
div:divs)
{
System.out.println("Story:" +
div.text());
System.out.println("Link:" +
div.attr("href"));
System.out.println('\n');
}
}
catch(IOException
ex)
{
ex.printStackTrace();
}
}
}
The
connect(String
url)
method creates a new Connection
, and get()
brings
and parses a HTML file. The select method, which is used to query based on intelligent
jsoup query language. In this example we are looking for div tag with class called ‘story’. It also has an attr
method, where, for a given element we can retrieve a
specific attribute, in this example, we are retrieving href attribute of “a”
link html tag.
Output:
title:TODAY - Latest News, Video & Guests from the TODAY show
on NBC
Top Headline News:
WNBC: Democrat de Blasio projected to win NYC mayor race
Link:http://www.nbcnewyork.com/news/local/NYC-Mayor-Race-Democrat-Bill-de-Blasio-Joe-Lhota-Republican-230579591.html?asdfasdfadfad
Obama heads toTexas to sell health care law
Link:http://www.nbcnews.com/health/obama-heads-high-stakes-texas-sell-health-care-law-8C11538530
Pope Francis polls Catholics on divorce, same-sex marriage
Link:http://worldnews.nbcnews.com/_news/2013/11/05/21320327-pope-francis-latest-surprise-a-survey-on-the-modern-family?lite
Wrong page? Zombies invade Fox News website
Link:http://www.nbcnews.com/technology/wrong-page-fox-news-website-appears-be-running-test-content-8C11535730
Relax! No real proof bacon will hurt sperm
Link:http://www.nbcnews.com/health/no-real-proof-bacon-can-hurt-sperm-so-let-your-8C11535732
I wanted to ‘protect’: LAX officer recounts attack
Link:http://usnews.nbcnews.com/_news/2013/11/05/21312566-i-came-to-the-tsa-to-protect-people-injured-officer-recounts-lax-attack?lite
Retailers want to make ?showrooming? a no-show
Link:http://www.nbcnews.com/business/retailers-want-make-showrooming-no-show-8C11535653
Story:Skydivers will jump again, Michelle Knight reflects on
captivity
Link:http://www.today.com/news/todays-takeaway-skydivers-will-jump-again-michelle-knight-reflects-captivity-8C11535663
Story:NBC News: Christie to win N.J. governor's race
Link:http://nbcpolitics.nbcnews.com/_news/2013/11/05/21322371-christie-poised-for-big-re-election-win-in-nj-with-2016-on-the-horizon?lite
Story:Runner Joy Johnson, 86, dies one day after New York City
Marathon
Link:http://www.today.com/news/famed-runner-joy-johnson-86-dies-one-day-after-nyc-8C11535662
Story:Photos capture close bond between conservationists and
lioness
Link:http://www.today.com/pets/photographs-capture-incredible-bond-between-conservationists-lioness-8C11535735
No comments:
Post a Comment