X

PowerShell power tips: Scripting downloads to parse web content

One of the really cool aspects of PowerShell that I rarely hear discussed is PowerShell’s ability to parse web content. The capabilities that exist are far too extensive to cover them all in a single article, but I wanted to take the opportunity to show you a few PowerShell power tips and tricks to discover links within a web page, and even how to use those links to download files.

Over the years, I have read several articles that explain how to use PowerShell to compile a list of the links that exist within a web page. Most of these articles make it seem as though the process is really complicated. In reality, however, you can find all the links by using a single line of code.

Suppose for a moment that I wanted to find all of the links that exist on the home page of my own website (www.BrienPosey.com). I could accomplish this task by entering the following command:


(Invoke-WebRequest -URI http://www.brienposey.com).Links


Don’t forget the parentheses

Notice that most of the command has to be enclosed in parenthesis in order for the command to work. You can see the command and a partial output in the screenshot below.


As you can see in the above screenshot, there is quite a bit of information displayed for each link. We can see everything from the link’s text to the destination URI (which is listed as href). But what if we wanted to condense the output a little bit and create a list of the link URLs, without all of the other information? Such an action is surprisingly easy to accomplish. Because the URLs are all contained in the href attribute, we can simply append .href to the end of the command shown above. Similarly, if we wanted to create a list of page names instead, we could append .InnerText to the command shown above. The commands would look like this:


(Invoke-WebRequest -URI http://www.brienposey.com).Links.href
(Invoke-WebRequest -URI http://www.brienposey.com).Links.InnerText


You can see how this works in the screenshot below.

Making it easier to read

If we wanted to put the link information into a format that is easier to read, we could use a command like this:


(Invoke-WebRequest -URI http://www.brienposey.com).Links | Select-Object InnerText, href


You can see the output in the screenshot below.

So as you can see, it’s relatively easy to acquire link information from a web page by using the PowerShell power tips here. The question, therefore, becomes how can we make use of this link data?

We can use link information to perform a scripted download of files. It is possible, for example, to write a script to download the latest application updates. Such a script could even perform version checking and automated installation, although I am not going to take things that far in this article. For right now, I want to focus on showing you how to perform an automated download.

If the link to the file that you want to download never changes, then you can simply hardcode the link. If the link is susceptible to change though, you will have to jump through a few more hoops.

Unfortunately, there is nothing (from a PowerShell perspective) that differentiates a file download link from a link to a web page. That being the case, you will have to use text filtering to narrow down the list of links. Let me show you how this works.

Since I don’t have any downloads on my own website, I am going to use the Expert GPS site for the sake of demonstration. Expert GPS is an awesome utility for GPS navigation and is something that I use all the time. However, I am only using this site for demonstration purposes. The technique that I am about to show you will work for just about any site.

If I were to use a variation of the commands that I have already shown you, I would get a list of every link on the page. However, that doesn’t really help me if my goal is to perform an automated download. Instead, I need to figure out what makes the download link different from the other links on the page. One possibility is that the download link might use the word download somewhere in the descriptive text. That being the case, we could filter on the word download. The command might look something like this:


(Invoke-WebRequest -URI "https://www.expertgps.com/download.asp").Links | Where-Object {$_ -like ‘*download*’} | Select-Object InnerText, href


You can see the output below:


In this case, two download links are returned, but only one actually uses the word download as a part of the URL. That being the case, we could further filter the output by directing the filter specifically to the href portion of the object. Here is what it looks like:


(Invoke-WebRequest -URI "https://www.expertgps.com/download.asp").Links | Where-Object {$_.href -like ‘*download*’} | Select-Object InnerText, href


Now keep in mind that there is no rule that says that I have to filter on the word “download.” I could have filtered on any text. The word download was just a logical starting point.

So to perform the download, there are two things that we need to do. First, we need to assign the link to a variable. In this case, I am calling the variable $URI. Second, we have to use the Invoke-WebRequest command to download the file from the specified URL. In the interest of simplicity, I am hard-coding the output filename, but there is a way to derive the filename from the site from which you are downloading the file. Here is what the commands look like:


$URI = (Invoke-WebRequest -URI "https://www.expertgps.com/download.asp").Links.href | Where-Object {$_ -like ‘*download*’}
Invoke-WebRequest -URI $URI -Outfile “C:\Data\ExpertGPS.exe” -PassThru


You can see these commands in action in the screenshot below.

PowerShell power tips: So much more you can do

This is actually only the tip of the iceberg with regard to what you can do using the Invoke-WebRequest cmdlet. As previously noted, you can extract filenames from the website and use those filenames when saving the file that you are downloading. It’s also possible to parse downloadable content.