Lesson 6: Filtering a Web Page

We will now integrate what we learnt in the last two lessons into a rails controller that allows us to filter out everything from a wikipedia page, except the main text content.

We start as usual with

script/generate controller wikifilter index

then edit app/views/wikifilter/index.html.erb to contain the single line:

<%= @display %>

Our job in the controller portion, is to put something into @display. Towards this end, we edit app/controllers/wikifilter_controller.rb as follows:


class WikifilterController < ApplicationController
require 'net/http'
require 'rubygems'
require 'hpricot'
  def index
		w=params[:id]
		begin
			page=Net::HTTP.get('en.wikipedia.org','/wiki/'+ w)
			doc = Hpricot(page)
			bc=doc.search('#bodyContent')
			ds=bc/:p
			@display=ds.to_html
		rescue
			@display= 'Is the Internet on?'
		end
  end
end

New are the lines w=params[:id], which puts into w the part of the url that follows http://localhost:3000/wikifilter/index/ so if we seek to access the url http://localhost:3000/wikifilter/index/Ski, w will contain 'Ski'.

The other new line is @display=ds.to_html, which puts into @display not the text, as we had in lesson 5, but the actual html of the paragraphs, after suitable slicing out of whatever we do not need.