Contact

Subscribe via Email

Subscribe via RSS/JSON

Categories

Recent Posts

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

Elsewhere

Get a list of links in a web page (part 2)

A continuation to my previous post on getting links.

Internet Explorer is an application with a rich COM interface. And PowerShell can work with COM objects. Thus you can do the following:

The Document property is very powerful and lets you see a lot of details of the page. It has a subproperty Link that gives all the link elements in the page (it has nearly 800 properties, methods, and events!). The output is as objects, and since we are only interested in the actual link href elements we can select that property.

If you are PowerShell v3 things are even easier. There’s a cmdlet called Invoke-WebRequest who is your friend.

To get an object representing the website do:

To get all the links in that website:

And to just get a list of the href elements:

Like the System.Net.Webclient class Invoke-WebRequest has parameters to specify proxy, headers, encoding, etc.

Get a list of links in a web page (part 1)

Using the System.Net.Webclient class and using old-fashioned regexp to cull out links:

The System.Net.Webclient class

The system.net.webclient class can be used to deal with web pages.

To download and display pages this class has couple of methods:

  • DownloadData downloads the page and displays it as an array of bytes.
  • DownloadString downloads the page and displays it as one long string.
  • DownloadFile downloads the page and saves it to a file name you specify.

The class also has properties you can set to be used while downloading a page. For instance:

  • QueryString to specify pairs of query parameters and their values. For example: to do a Google search for the word “rakhesh” one can fetch the page http://www.google.com/search?q=rakhesh. This q=rakhesh is a query string, with q being a parameter and rakhesh being a value to the parameter. To do the same via the system.net.webclient class one would do the following:

  • Headers to specify pairs of headers that can be set when requesting the web page:

  • Credentials to specify credentials for accessing the web page:

  • ResponseHeaders to view the headers received in response.

There are other properties and methods too, the above are what I had a chance to look at today.