Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie
Hi there,
There is an issue with role permissions that is being worked on at the moment.
If you are having trouble with access or permissions on regional forums please post here to get access: https://www.boards.ie/discussion/2058365403/you-do-not-have-permission-for-that#latest

pick out tagged text from file

  • 11-08-2004 3:31pm
    #1
    Moderators, Arts Moderators Posts: 35,788 Mod ✭✭✭✭


    Hi,

    I'm looking for something (probably an awk command) to help me pull only a certain type of tagged text from a file. The open tag is
    <:cs "BoldOption"1>
    
    and the close tag is
    <:/cs>
    
    .

    To further complicate things, there are other tags which open differently but which close with <:/cs>.

    Anyone actually good at awk (I always promise I'm going to learn it but it's been almost 10 years now...)? :)


Comments

  • Registered Users, Registered Users 2 Posts: 68,317 ✭✭✭✭seamus


    I don't know awk, but something similar to:
    /<:cs "BoldOption"1>(.*)<:/cs>/
    
    Might need to escape a few characters in there too.


  • Registered Users, Registered Users 2 Posts: 1,865 ✭✭✭Syth


    There are some problems with that seamus. Firstly I don't think . includes newlines, so that will only work if both <:cs "BoldOption"1> and <:/cs> are on the same line, if they aren't then it won't match any. Secondly as pickarooney pointed out, there are other tags that close with <:/cs>, and .* will try to match the longest one possible so if you had something like:
    <:cs "BoldOption"1>textetext<:/cs>stuffstuffstuff<:cs "SomeOption"2>blahblahblah<:/cs>
    
    thn you're regex would set \1 (or $1 if you're using perl)to be "textetext<:/cs>stuffstuffstuff<:cs "SomeOption"2>blahblahblah", which is clearly wrong.

    Now assuming you're using some kind of basterdised xml-like markup to store your data, then I'm assuming that the tags start with < and if there is a < then that starts a tag (can you confirm that?), if that's so, then by having
    /<:cs "BoldOption"1>([^<]*)<:/cs>/
    
    , you would prevent it going to the last occurance of "<:/cs>" in the line. However if you <:cs ...> tag can have child elements then that regex won't work as this won't let you include the opening < of the child elements.

    A way to take out all the text even if the opening and closing tags are on different lines and there are child elements, would be with the l'il perl script:
    #! /usr/bin/perl -w
    $tagLevel = 0;
    @lines = <>;
    foreach (@lines) {
        while ($_ ne "") {
            if (/<:cs "BoldOption"1>(.*)$/ {
                $tagLevel++;
                $_ = $1;
            if (/^(.*)<:/cs>(.*)$/) {
                $tagLevel--;
                if ($tagLevel == 0) {
                    $_ = $1;
                    print;
                } else {
                    $_ = $1  . "<:/cs>" . $2;
                }
            }
            if (!($_ =~ /<:cs "BoldOption"1>/ or $_ =~ /^(.*)<:/cs>(.*)$/)) {
                $_ = "";
            }
            if ($tagLevel > 0) {
                print;
           }
        }
    }
    
    OK, I think that should work. I just sat down and wrote that without trying it out. I had similar problems doing some basic XML work with perl.If it doesn't work, tell me the output/errors/warnings. Either way that was fun, keep up the interesting problems.


  • Moderators, Arts Moderators Posts: 35,788 Mod ✭✭✭✭pickarooney


    Thanks for the enthusiastic replies :)

    I've no experience with perl, but there seems to be an unescaped special character in lines 9 and 18 (the /c) according to the error message I get when I run it:
    Use of /c modifier is meaningless without /g at ./outtags.pl line 9.
    Scalar found where operator expected at ./outtags.pl line 9, near "*)$/" (Missing operator before $/?
    

    Also some syntax errors at lines 6,9,18 and 24.

    I'd like to test the regexp above. What command should I wrap it in?


  • Registered Users, Registered Users 2 Posts: 1,865 ✭✭✭Syth


    Right well I fixed those errors in the original code, it was just simple syntax stuff. However, once I'd done that, I tried to run the code. Yay! Infinite loop! So I scrapped it and am ttrying to write it from scratch. Following the FLOSS (free, libre, open source software) method of software production, I'm going to share the code as it stands, in the hopes that others can help write this and produce good code.
    #! /usr/bin/perl -w
    # Simple Extraction stuff. Released under the GNU GPL.
    $tagLevel = 0;
    $numLoops = 0;
    foreach (<>) {
    	chomp;
    	# $_ holds firstly the whole line, but gets trimmed down to every thing after a tag.
    	# When we've looked at the whole of this line, then move on to the next line.
    	while ($_ ne "") {
    		# See if this line contains an opening tag, if so process it.
    		# The < is to take text to the end of the tag, and the end of the tag
    		# The / is to make sure we don't get a closing tag.
    		if (/<([^>/]+)>(.*)$/) {
    			# $remainingText is the text after the tag.
    			$tagText = $1; $remainingText = $2;
    			# See is this tag the right type of tag.
    			if ($tagText =~ /:cs "BoldOption"1/) {
    				$tagLevel++;
    			}
    			# We need to look at the rest of the line, so we set $_ to be $remainingText
    			$_ = $remainingText;
    		}
    		# Now check for closing tags.
    		if (/<\/([^>]+)>(.*)$/) {
    			$tagText = $1; $remaingText = $2;
    			if ($tagText =~ /:cs/) {
    				$tagLevel--;
    				print if $tagLevel == 0;
    			}
    			$_ = $remainingText;
    		}
    		print if $tagLevel == 0;
    		
    		# This is prevent us going into an infinte loop. It helps debuging.
    		$numLoops++;
    		last if $numLoops > 10;
    	}
    }
    
    Y'know this would be easy if you were using XML, because there are perl modules that could do this in a few lines.


  • Moderators, Arts Moderators Posts: 35,788 Mod ✭✭✭✭pickarooney


    Yep, if it werwe XML I'd just have to run the existing tool over it ;)
    Still gettting errors on line 13 with this script, sorry.


  • Advertisement
Advertisement