If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)

Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Hi there,
There is an issue with role permissions that is being worked on at the moment.
If you are having trouble with access or permissions on regional forums please post here to get access: https://www.boards.ie/discussion/2058365403/you-do-not-have-permission-for-that#latest

Scripting

26-04-2013 11:47am

DaireQuinlan

Registered Users, Registered Users 2 Posts: 6,713 ✭✭✭

Join Date: November 2006

Posts: 6467

Not too sure where to put this.
I spent an educational couple of hours last evening with a copy of 'python for dummies' and wrote some python to scrape the random photo threads and produce a list of top thanked posts.

Usage is ...

Usage: -t thread-number [-t thread-number] [-s start-date(dd-mm-yyyy) [-e end-date(dd-mm-yyyy)]] [-n number]

Multiple threads can be specified by specifying more than one -t option followed by the thread id.
If no start date is specified the entire thread or threads is searched for posts to compare.
If a start date is specified but no end date the end date is set to 7 days after the start date. The time ranges are INCLUSIVE

The thread id is the one that appears in the URL of the form http://www.boards.ie/vbulletin/showthread.php?t=2056814041 where the 't' parameter is the thread id.

Output is to stdout so you can pipe it or select, and is simple BBCode with the name of the poster, thanks, and the first image from the post, wrapped in a link to the post itself (which I think people were always keen on)

Here's the code for anyone interested, I'm running Python 2.6.5, script has external dependencies on the latest version of BeautifulSoup for the scrapiness, and the lxml parser. Other than that should be fairly straightforward. Disclaimer: scrapers are pretty sensitive to changes in output and formatting. Any changes to the boards HTML will probably break this...

-edit- hmm. The

tags in vbulletin don't display the embed properly. If you 'quote' this post and select the code you'll get it all. -edit-

[code]
import httplib
import urllib
import re
import sys, getopt
import operator
from bs4 import BeautifulSoup 
from datetime import date
from datetime import datetime
from datetime import timedelta

class post:
    id = 0
    postdate = date.today()
    url = ""
    count = 0
    username = ""    
    img = ""
    thanks = 0

def templatePosts(posts,num):
    
    for (i,post) in enumerate(posts):
        print "[CENTER][B]"+post.username+" ("+str(post.thanks)+")[/B]\n[URL=http://www.boards.ie/vbulletin/"+post.url+"][IMG]"+post.img+"[/IMG][/URL][/CENTER]\n\n"
        if i > num: break
        
    
def parseSinglePage(soupyPage, start,end, posts, skipFirst = False):

    idRegex = re.compile("post([0-9]+)")
    thanksRegex = re.compile("\(([0-9]+)\) thanks from:")
    dateTimeRegex = re.compile("(Yesterday|Today|[0-9,-]+),(.)")
    all_tables = soupyPage.find_all("table",id=idRegex)

    for (i,table) in enumerate(all_tables):

        if i == 0 and skipFirst:
            continue

        aPost = post()

        m = idRegex.match(table.get('id'))
        aPost.id = m.group(1)

        headers = table.find_all("td","thead")
        #should be two td elements with class thead. First one will be contain the date, second one will have the post count & url
        dateString = str(headers[0].contents[2].string).strip(' \n\r\t').strip(' \t\n\r')
        if dateString != None:
            dateString = dateTimeRegex.match(dateString).group(1)
            if dateString == "Today":
                aPost.postdate = date.today()
            elif dateString == "Yesterday":
                oneday = timedelta(days=1)
                aPost.postdate = date.today() - oneday
            else:
                aPost.postdate = datetime.strptime(dateString,"%d-%m-%Y").date()
                
        aPost.url  = str(headers[1].find("a").get("href"))
        aPost.count = str(headers[1].find("strong").string)
        
        #search into the table element and find the div w/ id = postmenu_<postId> then the 'a' tag underneath that 
        #which will give us the username of the poster.
        username = table.find("div",id="postmenu_"+aPost.id).find("a").string
        aPost.username = username

        #now search into the table element again to find the div with the same id and class postcontent
        #then search into IT to get the img tag if it's there and grab the src from it
        #this will only find the first image in the post. Meh.
        postcontent = table.find("div",id="post_message_"+aPost.id,class_="postcontent")
        imgTag = postcontent.find("img")
        if imgTag != None:
            aPost.img = str(imgTag.get("src"))
        else:
            aPost.img = None
        
        
        #print "Post. id: "+aPost.id+" Count: "+aPost.count+" User: "+aPost.username
        #use the ID above to lazily get the DIV that has the thanks box in it, then css select into that and get the strong element to 
        #get the actual text of the thanks string, then extract the thanks count from the string with the regex '''
        thanksCount = 0    
        thanksString = soupyPage.find("div",id="post_thanks_box_"+aPost.id).select(".alt2 > strong")
        if thanksString:
            thanksMatch = thanksRegex.match(thanksString[0].string)
            if thanksMatch != None:
                thanksCount = thanksMatch.group(1)
            else:
                thanksCount = "1"

        aPost.thanks = int(thanksCount)

        #we only want to add the post if EITHER the start isn't set (we use the entire thread)
        #or the start < post date and the end > post date 
        #AND the thanks > 0
        #AND there's an image actually in the post.
        if (start is None or (start <= aPost.postdate and end >= aPost.postdate)) and aPost.thanks > 0 and aPost.img != None:
            posts.append(aPost)

def handleOneThread(conn, threadId, start, end, posts):
    
    conn.request("GET","/vbulletin/showthread.php?t="+threadId)
    response = conn.getresponse()
    print >> sys.stderr,"Thread id: "+threadId+" "+str(response.status) +" "+str(response.reason)
    if response.status != 200 :
        print >> sys.stderr,"Unable to load thread from id "+threadId+""
        sys.exit(1)

    allPage = response.read()
    soup = BeautifulSoup(allPage,"lxml")

    pageRegex = re.compile("Page 1 of ([0-9]+)")
    pages = soup.find("table",class_="pagination_wrapper")
    pagesMatch = 1
    if pages:
        pagesString = str(pages.find("td", class_ = "vbmenu_control").string).strip(' \n\r\t')
        pagesMatch = int(pageRegex.match(pagesString).group(1))

    print >> sys.stderr,"Page 1 of "+str(pagesMatch)
    parseSinglePage(soup, start, end, posts, True)    
    for pageNumber in range(2,pagesMatch+1):
        conn.request("GET","/vbulletin/showthread.php?t="+threadId+"&page="+str(pageNumber))
        response = conn.getresponse()
        allPage = response.read()
        print >> sys.stderr,"Retrieved page "+str(pageNumber)+" of "+str(pagesMatch) +" " + str(response.status) +" "+str(response.reason)
        soup = BeautifulSoup(allPage,"lxml")
        parseSinglePage(soup, start, end, posts)

def main(argv):
    #get the initial connection, GET the first page and parse for page numbers.
    
    try:
        opts, args = getopt.getopt(argv,"t:s:e:n:",["thread=","start=","end=","num="])
    except getopt.GetoptError:
        usage()
        sys.exit(2)

    start = None 
    end = None
    num = 10
    threads = []
    filename = None
    
    for opt, arg in opts:
        if opt in ("-t","--thread"):
            threads.append(arg)
        elif opt in("-s","--start"):
            try:
                start = datetime.strptime(arg,"%d-%m-%Y").date()
            except:
                print >> sys.stderr, "Start date '"+arg+"' must be of the form dd-mm-yyyy. I.E. 15-04-2013"
                exit(2)
        elif opt in("-e","--end"):
            try:
                end = datetime.strptime(arg,"%d-%m-%Y").date()
            except:
                print >> sys.stderr, "End date '"+arg+"' must be of the form dd-mm-yyyy. I.E. 15-04-2013"
                exit(2)
        elif opt in("-n","--num"):
            num = int(arg)
        
    if start != None and end is None:
        week = timedelta(days=7)
        end = start + week

    if len(threads) == 0:
        usage()
        sys.exit(2)

    if start != None:    
        print >> sys.stderr, "Timerange for posts is from "+str(start)+" to "+str(end)


    conn = httplib.HTTPConnection('www.boards.ie',80)
    conn.connect()

    posts = []
    
    for threadId in threads:
        handleOneThread(conn, threadId, start, end, posts)

    #how many retrieved posts in total        
    print >> sys.stderr, "Posts: "+str(len(posts) )
    
    #sort the posts by thanks
    posts = sorted(posts, key=operator.attrgetter('thanks'), reverse=True)   

    #just quickly output just the top N posts
    for (i,post) in enumerate(posts):
        print >> sys.stderr, "Post. id: "+str(post.id)+" Count: "+str(post.count)+" User: "+str(post.username)+" Thanks: "+str(post.thanks)+" Date: "+str(post.postdate)
        if i > num: break

    #now output the posts to stdout in some nice BBCODE format

    templatePosts(posts,num)


def usage():
    print >> sys.stderr, "Usage: -t thread-number [-t thread-number] [-s start-date(dd-mm-yyyy) [-e end-date(dd-mm-yyyy)]] [-n number]"
    print >> sys.stderr, "\nMultiple threads can be specified by specifying more than one -t option followed by the thread id."
    print >> sys.stderr, "If no start date is specified the entire thread or threads is searched for posts to compare."
    print >> sys.stderr, "If a start date is specified but no end date the end date is set to 7 days after the start date. The time ranges are INCLUSIVE"

if __name__ == "__main__":
    main(sys.argv[1:])

Comments

#2 26-04-2013 11:51am

DaireQuinlan

Registered Users, Registered Users 2 Posts: 6,713 ✭✭✭

Join Date: November 2006

Posts: 6467

Output from the following command:

python thankless.py -t 2056814041 -s 26-01-2013

is as follows. This is the top 10 thanked posts from the Random Photos XLI thread, from the 26th Jan to the 02nd Feb inclusive, so basically the basis of the next thread that DGK needs to create

They were all contained in the one thread this time, but multiple threads can be specified if necessary and the post list will be built up from a combination of them.
maddog (40)

Doom (38)
[URL=https://us.v-cdn.net/6034073/uploads/attachments/330238/238110.JPG[/IMG][/URL]

Reati (36)

Mjollnir (30)

mikka631 (27)

.Longshanks. (27)

redape99 (27)

gerk86 (26)

dirtyghettokid (25)

Art Deko (24)

Art Deko (23)

Art Deko (23)

0
#3 26-04-2013 11:53am

Kenny Logins

Registered Users, Registered Users 2 Posts: 9,060 ✭✭✭

Join Date: September 2010

Posts: 8918

Clever.

0
#4 26-04-2013 11:53am

syklops

Closed Accounts Posts: 18,966 ✭✭✭✭

Join Date: September 2004

Posts: 18753

Thats an impressive bit of work for someone who only started learning Python yesterday. Surely you have some previous programming experience?

0
#5 26-04-2013 11:55am

DaireQuinlan

Registered Users, Registered Users 2 Posts: 6,713 ✭✭✭

Join Date: November 2006

Posts: 6467

syklops wrote: »

Thats an impressive bit of work for someone who only started learning Python yesterday. Surely you have some previous programming experience?

(OP may have contained some facetiousness)

0
#6 26-04-2013 12:02pm

syklops

Closed Accounts Posts: 18,966 ✭✭✭✭

Join Date: September 2004

Posts: 18753

DaireQuinlan wrote: »

(OP may have contained some facetiousness)

Im glad to hear that. I had a real feeling of inadequacy there for a few minutes.

0
Advertisement
#7 26-04-2013 12:26pm

dirtyghettokid

Registered Users, Registered Users 2 Posts: 2,988 ✭✭✭

Join Date: June 2010

Posts: 2841

that's cool! the only issue is that some users would win POTW and HM, which if they win POTW i tend to not give HM, and give it to next in line. not sure if this is "fair" in others eyes, but it's just the way i've been doing it. i'll have another look later.. up to my eyes at the min. but great idea!! good work :cool:

0
#8 26-04-2013 12:41pm

DaireQuinlan

Registered Users, Registered Users 2 Posts: 6,713 ✭✭✭

Join Date: November 2006

Posts: 6467

dirtyghettokid wrote: »

that's cool! the only issue is that some users would win POTW and HM, which if they win POTW i tend to not give HM, and give it to next in line. not sure if this is "fair" in others eyes, but it's just the way i've been doing it. i'll have another look later.. up to my eyes at the min. but great idea!! good work :cool:

Yeah I figured. You can output any number of posts with the [-n num] command line option, it defaults to 10 so you can weed out POTW winners and I assume duplicates manually. Or feel free to add it to the code

0
#9 26-04-2013 3:45pm

5uspect

Moderators, Arts Moderators Posts: 10,520 Mod ✭✭✭✭

Join Date: May 2006

Posts: 10279

I've only recently started learning Python myself, this is seriously impressive.

0