Shell command to extract html source code content given an url

guylikeme · 14-03-2016 12:58PM #1

Trying curl and wget with little joy

Graham · 14-03-2016 01:21PM

What have you tried and what are you getting back?

guylikeme · 14-03-2016 02:07PM

curl -u <user>|<password><URL>

gives output of page but not the html source e.g I cant find the text thats on the page

14-03-2016 02:38PM

guylikeme wrote: »

curl -u <user>|<password><URL>

gives output of page but not the html source e.g I cant find the text thats on the page

What do you mean by "output of page "?

curl www.boards.ie

prints

<!DOCTYPE html>
<html>
        <head>
                <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
                <meta name="viewport" content="width=device-width,initial-scale=1" />
                <title>boards.ie - Now Ye're Talkin'</title>
[...]

i.e. the html source

guylikeme · 14-03-2016 03:04PM

What i want is to get the text of the page

So on the boards example, i would get the thread titles

deconduo · 14-03-2016 03:07PM

guylikeme wrote: »

What i want is to get the text of the page

So on the boards example, i would get the thread titles

Can you give us an example of what you are getting, and what you think you should be getting?

ED E · 14-03-2016 03:16PM

Think the OP has the terminology wrong, HTML source != HTML Text Content

What you want is a HTML parser/scraper like Beautiful Soup. So you can find a tag and get its content.

soup = ....
threadName = soup.head.content

Syntax is something like that, its really really friendly. If you work in Python its great and if not your language probably has an equivalent.

guylikeme · 14-03-2016 03:17PM

Ok, the output is best explained by doing the following...

1. Open Internet Explorer to a page that contains text
2. Save file (html)
3. Open the html in notepad

The text of the page is in that file - this is what i want to obtain.

14-03-2016 03:19PM

guylikeme wrote: »

What i want is to get the text of the page

So on the boards example, i would get the thread titles

What you're looking for is called "scraping". Using the shell to do it is not going to be easy, and a bad place to start looking. I would suggest using something like python with the Beautiful Soup library .

It's a complex area. Using the example above of getting boards thread titles from the home page: the thread titles are loaded by Javascript after the page loads. If you get the homepage source using curl, it won't have the titles because it doesn't run javascript.

Buford T Justice · 14-03-2016 11:50PM

As has been said above. Curl will give you the html content. You are looking to parse it and extract certain elements.

Doubt shell will make that very easy for you at all. An altertnative to python is jSoup in Java

daymobrew · 15-03-2016 09:21AM

perl and the LWP::Simple module will download the html file and then you can parse it easily.

OfflerCrocGod · 16-03-2016 09:10AM

Cheerio for node provides a jQuery like API if you are more used to that https://www.npmjs.com/package/cheerio example usage here https://github.com/briandipalma/random-scripts/blob/master/request_videos.js#L41

Graham · 16-03-2016 09:25AM

OP, you might get some more useful/specific suggestions if you explain to us what problem you're trying to solve.

Is this a one-off task or will you want to scape content regularly?
If it's scheduled will you be scraping the same sites/pages all the time?
What do you intent to do with the content once you have retrieved it (put it in a database etc).
Are you a developer looking to put together your own solution or are you just looking for the quickest/easiest way to grab the content?
If you're a dev, what languages are you familiar with.

guylikeme · 21-03-2016 04:34PM

Graham wrote: »

OP, you might get some more useful/specific suggestions if you explain to us what problem you're trying to solve.

Is this a one-off task or will you want to scape content regularly?
If it's scheduled will you be scraping the same sites/pages all the time?
What do you intent to do with the content once you have retrieved it (put it in a database etc).
Are you a developer looking to put together your own solution or are you just looking for the quickest/easiest way to grab the content?
If you're a dev, what languages are you familiar with.

One off.

Put it in a String to simply check if it contains certain String.

Java/Python/Bash

h57xiucj2z946q · 21-03-2016 05:25PM

Can you post the url to the actual page?

I'm guessing the page is using ajax calls or maybe even dreaded iframes, hence you are not finding the text you are looking for in the raw html source for the given url.

guylikeme · 22-03-2016 09:42AM

daymobrew wrote: »

perl and the LWP::Simple module will download the html file and then you can parse it easily.

Can you show an example

daymobrew · 22-03-2016 10:10AM

guylikeme wrote: »

Can you show an example

#!/usr/bin/perl -w

use strict;
use LWP::Simple qw(get);

if ($ARGV[0]) {
  my $html = get $ARGV[0];
  
  # Contents of file now in $html.
}

Shell command to extract html source code content given an url

Comments