Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie
Hi there,
There is an issue with role permissions that is being worked on at the moment.
If you are having trouble with access or permissions on regional forums please post here to get access: https://www.boards.ie/discussion/2058365403/you-do-not-have-permission-for-that#latest

Web Form/Database problem

  • 27-04-2005 5:31pm
    #1
    Closed Accounts Posts: 3,322 ✭✭✭


    Basically what I am trying to ask is - is there a way of viewing a page source if you know the exact page name - without going into the web page and right clicking 'view source'. Like doing it through cmd in dos or something. I have a few million web pages I need the source of so was thinking of writing some sort of a script or something to automate it.

    Thanks!


Comments

  • Registered Users, Registered Users 2 Posts: 1,268 ✭✭✭hostyle


    Perl and LWP


  • Closed Accounts Posts: 17,208 ✭✭✭✭aidan_walsh


    Just use your script to send a HTTP GET request for the page and read everything after the first single blank line. Everything above that is a HTTP header and irrelevant to your needs.
    GET [i]url[/i] HTTP/1.0
    


  • Closed Accounts Posts: 4,655 ✭✭✭Ph3n0m


    php is your friend

    <?
    $filename = 'test.txt';
    $handle = fopen("http://www.boards.ie/", "rb");
    $contents = '';
    while (!feof($handle)) {
      $contents .= fread($handle, 8192);
    }
    $writer = fopen($filename, 'a');
    fwrite($writer, $contents); 
    fclose($handle);
    
    echo "done";
    ?>
    
    

    just get the previous code into a loop, feeding it urls and creating dynamic text files and thus you have you own page source storer :)


  • Closed Accounts Posts: 3,322 ✭✭✭Repli


    Thanks to everyone for the suggestions =D
    At the moment I did something quick with wget, its working alright but I am probably gonna use something like php because I will be adding to this.

    $contents .= fread($handle, 8192);

    Ph3n0m just a quick question - is that 8192 = 8mb? So the most you can read is 8mb of source at a time? Thanks


  • Closed Accounts Posts: 4,655 ✭✭✭Ph3n0m


    http://ie.php.net/manual/en/function.fwrite.php

    If the length argument is given (in this case 8192), writing will stop after length bytes have been written or the end of string is reached, whichever comes firs


  • Advertisement
  • Closed Accounts Posts: 756 ✭✭✭Zaph0d


    How would you do this if you wanted to store enough info to render the page offline? eg images, frames etc.
    would you need code to find the source of all referenced URLs in the html retrieved in the original HTTP-GET?
    would this have to be recursive?


  • Moderators, Science, Health & Environment Moderators Posts: 9,035 Mod ✭✭✭✭mewso


    Knowing it's a complete waste of time but VB.Net is also your friend:-
    Dim myReq As System.Net.HttpWebRequest = CType(System.Net.WebRequest.Create("http://boards.ie/"), System.Net.HttpWebRequest)
    Dim myStream As System.IO.Stream
    myStream = myReq.GetResponse.GetResponseStream
    Dim objReader As New System.IO.StreamReader(myStream)
    Dim PageContent As String = objReader.ReadToEnd
    


  • Registered Users, Registered Users 2 Posts: 1,268 ✭✭✭hostyle


    Zaph0d wrote:
    How would you do this if you wanted to store enough info to render the page offline? eg images, frames etc.
    would you need code to find the source of all referenced URLs in the html retrieved in the original HTTP-GET?
    would this have to be recursive?

    google for web spider / slurper / offline downloader software


  • Registered Users, Registered Users 2 Posts: 4,003 ✭✭✭rsynnott


    Just use your script to send a HTTP GET request for the page and read everything after the first single blank line. Everything above that is a HTTP header and irrelevant to your needs.
    GET [i]url[/i] HTTP/1.0
    

    Careful; if the site(s) use(s) virtual hosting, you should do HTTP/1.1. Otherwise it may not work.


Advertisement