Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie
Hi there,
There is an issue with role permissions that is being worked on at the moment.
If you are having trouble with access or permissions on regional forums please post here to get access: https://www.boards.ie/discussion/2058365403/you-do-not-have-permission-for-that#latest

Scripting

  • 26-04-2013 11:47am
    #1
    Registered Users, Registered Users 2 Posts: 6,713 ✭✭✭


    Not too sure where to put this.
    I spent an educational couple of hours last evening with a copy of 'python for dummies' and wrote some python to scrape the random photo threads and produce a list of top thanked posts.

    Usage is ...
    Usage: -t thread-number [-t thread-number] [-s start-date(dd-mm-yyyy) [-e end-date(dd-mm-yyyy)]] [-n number]
    
    Multiple threads can be specified by specifying more than one -t option followed by the thread id.
    If no start date is specified the entire thread or threads is searched for posts to compare.
    If a start date is specified but no end date the end date is set to 7 days after the start date. The time ranges are INCLUSIVE
    

    The thread id is the one that appears in the URL of the form http://www.boards.ie/vbulletin/showthread.php?t=2056814041 where the 't' parameter is the thread id.

    Output is to stdout so you can pipe it or select, and is simple BBCode with the name of the poster, thanks, and the first image from the post, wrapped in a link to the post itself (which I think people were always keen on)

    Here's the code for anyone interested, I'm running Python 2.6.5, script has external dependencies on the latest version of BeautifulSoup for the scrapiness, and the lxml parser. Other than that should be fairly straightforward. Disclaimer: scrapers are pretty sensitive to changes in output and formatting. Any changes to the boards HTML will probably break this...

    -edit- hmm. The
    tags in vbulletin don't display the embed properly. If you 'quote' this post and select the code you'll get it all. -edit-
    
    [code]
    import httplib
    import urllib
    import re
    import sys, getopt
    import operator
    from bs4 import BeautifulSoup 
    from datetime import date
    from datetime import datetime
    from datetime import timedelta
    
    class post:
        id = 0
        postdate = date.today()
        url = ""
        count = 0
        username = ""    
        img = ""
        thanks = 0
    
    def templatePosts(posts,num):
        
        for (i,post) in enumerate(posts):
            print "[CENTER][B]"+post.username+" ("+str(post.thanks)+")[/B]\n[URL=http://www.boards.ie/vbulletin/"+post.url+"][IMG]"+post.img+"[/IMG][/URL][/CENTER]\n\n"
            if i > num: break
            
        
    def parseSinglePage(soupyPage, start,end, posts, skipFirst = False):
    
        idRegex = re.compile("post([0-9]+)")
        thanksRegex = re.compile("\(([0-9]+)\) thanks from:")
        dateTimeRegex = re.compile("(Yesterday|Today|[0-9,-]+),(.)")
        all_tables = soupyPage.find_all("table",id=idRegex)
    
        for (i,table) in enumerate(all_tables):
    
            if i == 0 and skipFirst:
                continue
    
            aPost = post()
    
            m = idRegex.match(table.get('id'))
            aPost.id = m.group(1)
    
            headers = table.find_all("td","thead")
            #should be two td elements with class thead. First one will be contain the date, second one will have the post count & url
            dateString = str(headers[0].contents[2].string).strip(' \n\r\t').strip(' \t\n\r')
            if dateString != None:
                dateString = dateTimeRegex.match(dateString).group(1)
                if dateString == "Today":
                    aPost.postdate = date.today()
                elif dateString == "Yesterday":
                    oneday = timedelta(days=1)
                    aPost.postdate = date.today() - oneday
                else:
                    aPost.postdate = datetime.strptime(dateString,"%d-%m-%Y").date()
                    
            aPost.url  = str(headers[1].find("a").get("href"))
            aPost.count = str(headers[1].find("strong").string)
            
            #search into the table element and find the div w/ id = postmenu_<postId> then the 'a' tag underneath that 
            #which will give us the username of the poster.
            username = table.find("div",id="postmenu_"+aPost.id).find("a").string
            aPost.username = username
    
            #now search into the table element again to find the div with the same id and class postcontent
            #then search into IT to get the img tag if it's there and grab the src from it
            #this will only find the first image in the post. Meh.
            postcontent = table.find("div",id="post_message_"+aPost.id,class_="postcontent")
            imgTag = postcontent.find("img")
            if imgTag != None:
                aPost.img = str(imgTag.get("src"))
            else:
                aPost.img = None
            
            
            #print "Post. id: "+aPost.id+" Count: "+aPost.count+" User: "+aPost.username
            #use the ID above to lazily get the DIV that has the thanks box in it, then css select into that and get the strong element to 
            #get the actual text of the thanks string, then extract the thanks count from the string with the regex '''
            thanksCount = 0    
            thanksString = soupyPage.find("div",id="post_thanks_box_"+aPost.id).select(".alt2 > strong")
            if thanksString:
                thanksMatch = thanksRegex.match(thanksString[0].string)
                if thanksMatch != None:
                    thanksCount = thanksMatch.group(1)
                else:
                    thanksCount = "1"
    
            aPost.thanks = int(thanksCount)
    
            #we only want to add the post if EITHER the start isn't set (we use the entire thread)
            #or the start < post date and the end > post date 
            #AND the thanks > 0
            #AND there's an image actually in the post.
            if (start is None or (start <= aPost.postdate and end >= aPost.postdate)) and aPost.thanks > 0 and aPost.img != None:
                posts.append(aPost)
    
    def handleOneThread(conn, threadId, start, end, posts):
        
        conn.request("GET","/vbulletin/showthread.php?t="+threadId)
        response = conn.getresponse()
        print >> sys.stderr,"Thread id: "+threadId+" "+str(response.status) +" "+str(response.reason)
        if response.status != 200 :
            print >> sys.stderr,"Unable to load thread from id "+threadId+""
            sys.exit(1)
    
        allPage = response.read()
        soup = BeautifulSoup(allPage,"lxml")
    
        pageRegex = re.compile("Page 1 of ([0-9]+)")
        pages = soup.find("table",class_="pagination_wrapper")
        pagesMatch = 1
        if pages:
            pagesString = str(pages.find("td", class_ = "vbmenu_control").string).strip(' \n\r\t')
            pagesMatch = int(pageRegex.match(pagesString).group(1))
    
        print >> sys.stderr,"Page 1 of "+str(pagesMatch)
        parseSinglePage(soup, start, end, posts, True)    
        for pageNumber in range(2,pagesMatch+1):
            conn.request("GET","/vbulletin/showthread.php?t="+threadId+"&page="+str(pageNumber))
            response = conn.getresponse()
            allPage = response.read()
            print >> sys.stderr,"Retrieved page "+str(pageNumber)+" of "+str(pagesMatch) +" " + str(response.status) +" "+str(response.reason)
            soup = BeautifulSoup(allPage,"lxml")
            parseSinglePage(soup, start, end, posts)
    
    def main(argv):
        #get the initial connection, GET the first page and parse for page numbers.
        
        try:
            opts, args = getopt.getopt(argv,"t:s:e:n:",["thread=","start=","end=","num="])
        except getopt.GetoptError:
            usage()
            sys.exit(2)
    
        start = None 
        end = None
        num = 10
        threads = []
        filename = None
        
        for opt, arg in opts:
            if opt in ("-t","--thread"):
                threads.append(arg)
            elif opt in("-s","--start"):
                try:
                    start = datetime.strptime(arg,"%d-%m-%Y").date()
                except:
                    print >> sys.stderr, "Start date '"+arg+"' must be of the form dd-mm-yyyy. I.E. 15-04-2013"
                    exit(2)
            elif opt in("-e","--end"):
                try:
                    end = datetime.strptime(arg,"%d-%m-%Y").date()
                except:
                    print >> sys.stderr, "End date '"+arg+"' must be of the form dd-mm-yyyy. I.E. 15-04-2013"
                    exit(2)
            elif opt in("-n","--num"):
                num = int(arg)
            
        if start != None and end is None:
            week = timedelta(days=7)
            end = start + week
    
        if len(threads) == 0:
            usage()
            sys.exit(2)
    
        if start != None:    
            print >> sys.stderr, "Timerange for posts is from "+str(start)+" to "+str(end)
    
    
        conn = httplib.HTTPConnection('www.boards.ie',80)
        conn.connect()
    
        posts = []
        
        for threadId in threads:
            handleOneThread(conn, threadId, start, end, posts)
    
        #how many retrieved posts in total        
        print >> sys.stderr, "Posts: "+str(len(posts) )
        
        #sort the posts by thanks
        posts = sorted(posts, key=operator.attrgetter('thanks'), reverse=True)   
    
        #just quickly output just the top N posts
        for (i,post) in enumerate(posts):
            print >> sys.stderr, "Post. id: "+str(post.id)+" Count: "+str(post.count)+" User: "+str(post.username)+" Thanks: "+str(post.thanks)+" Date: "+str(post.postdate)
            if i > num: break
    
        #now output the posts to stdout in some nice BBCODE format
    
        templatePosts(posts,num)
    
    
    def usage():
        print >> sys.stderr, "Usage: -t thread-number [-t thread-number] [-s start-date(dd-mm-yyyy) [-e end-date(dd-mm-yyyy)]] [-n number]"
        print >> sys.stderr, "\nMultiple threads can be specified by specifying more than one -t option followed by the thread id."
        print >> sys.stderr, "If no start date is specified the entire thread or threads is searched for posts to compare."
        print >> sys.stderr, "If a start date is specified but no end date the end date is set to 7 days after the start date. The time ranges are INCLUSIVE"
    
    if __name__ == "__main__":
        main(sys.argv[1:])
    
    


Comments

  • Registered Users, Registered Users 2 Posts: 6,713 ✭✭✭DaireQuinlan


    Output from the following command:

    python thankless.py -t 2056814041 -s 26-01-2013

    is as follows. This is the top 10 thanked posts from the Random Photos XLI thread, from the 26th Jan to the 02nd Feb inclusive, so basically the basis of the next thread that DGK needs to create :D

    They were all contained in the one thread this time, but multiple threads can be specified if necessary and the post list will be built up from a combination of them.
    maddog (40)
    71D46126C17346E9942C2E5186711FB9-0000345914-0003145775-00800L-D556B140E1EF43D8828433C792BC3B85.jpg


    Doom (38)
    [URL=https://us.v-cdn.net/6034073/uploads/attachments/330238/238110.JPG[/IMG][/URL]


    Reati (36)
    6F22DF8E96EA4E3EA7AF6569339B7F35-0000341326-0003152595-00800L-AF70330844D44D0A9B8CC87AD04CA735.jpg


    Mjollnir (30)
    8430329172_93590bd181_c.jpg


    mikka631 (27)
    Goat_270113_3.jpg


    .Longshanks. (27)
    8425540161_2495538d15_z.jpg


    redape99 (27)
    8439570944_3ece9f13b9_c.jpg


    gerk86 (26)
    5CE7014860EE4B1FB1241F3EC2D9E4CC-0000333639-0003143930-00800L-AC70A6350AA24239965B9644399C1041.jpg


    dirtyghettokid (25)
    8437744799_5350cb6bf9_z.jpg


    Art Deko (24)
    8430302837_d967fd2e27_c.jpg


    Art Deko (23)
    8405900307_c9590cb4a0_c.jpg


    Art Deko (23)
    8422602769_81a6947cda_c.jpg


  • Registered Users, Registered Users 2 Posts: 9,060 ✭✭✭Kenny Logins


    Clever. :D


  • Closed Accounts Posts: 18,966 ✭✭✭✭syklops


    Thats an impressive bit of work for someone who only started learning Python yesterday. Surely you have some previous programming experience?


  • Registered Users, Registered Users 2 Posts: 6,713 ✭✭✭DaireQuinlan


    syklops wrote: »
    Thats an impressive bit of work for someone who only started learning Python yesterday. Surely you have some previous programming experience?

    (OP may have contained some facetiousness)


  • Closed Accounts Posts: 18,966 ✭✭✭✭syklops


    (OP may have contained some facetiousness)

    Im glad to hear that. I had a real feeling of inadequacy there for a few minutes.


  • Advertisement
  • Registered Users, Registered Users 2 Posts: 2,988 ✭✭✭dirtyghettokid


    that's cool! the only issue is that some users would win POTW and HM, which if they win POTW i tend to not give HM, and give it to next in line. not sure if this is "fair" in others eyes, but it's just the way i've been doing it. i'll have another look later.. up to my eyes at the min. but great idea!! good work :cool:


  • Registered Users, Registered Users 2 Posts: 6,713 ✭✭✭DaireQuinlan


    that's cool! the only issue is that some users would win POTW and HM, which if they win POTW i tend to not give HM, and give it to next in line. not sure if this is "fair" in others eyes, but it's just the way i've been doing it. i'll have another look later.. up to my eyes at the min. but great idea!! good work :cool:

    Yeah I figured. You can output any number of posts with the [-n num] command line option, it defaults to 10 so you can weed out POTW winners and I assume duplicates manually. Or feel free to add it to the code :)


  • Moderators, Arts Moderators Posts: 10,520 Mod ✭✭✭✭5uspect


    I've only recently started learning Python myself, this is seriously impressive.


Advertisement