Defunct files/cleaning out dead wood

seamus · 19-03-2008 1:17pm #1

Right, maybe more a web question than a programming one, but I have a problem whereby the web application I've inherited has what seems like a few hundred completely unused files.

Lots of them are obviously previous incarnations of a page (e.g. includes1.php, includes2.php, includes_old.php) but none of them have any documentation and there's a good chance that they *may* be referenced by one file or another.

What I'm really looking for is a way to remove the unreferenced files - probably something along the lines of a parser which takes a list of strings (i.e. file names), searches every file a particular directory and then spits back the strings which weren't found in any file.

I'm doing this mainly to make my life easier. A, I'm in the middle of rebuilding the app's backend and documenting everything and B, when I have to troubleshoot, it adds a few extra minutes with me trying to find out if the script is calling "includes1.php" or "includes2.php" and whether or not either of these include files actually contain the offending piece of code.

Any ideas?

kayos · 19-03-2008 2:03pm

Windows Key + F

Sometimes the easiest ways are the best

seamus · 19-03-2008 2:05pm

kayos wrote: »

Windows Key + F

Sometimes the easiest ways are the best

Yes, but can it be scripted?

I have a tonne of these files, so copying and pasting in the name of each file would be a PITA

Evil Phil · 19-03-2008 2:20pm

I don't know how you going to implement it exactly but you should develop it as an application and publish it. Lots of people would use it.

seamus · 19-03-2008 2:26pm

I've been looking for a reason to do something in Visual Studio 2005. Though I suppose it would be good to have it for any platform. Perl maybe?

Evil Phil · 19-03-2008 2:40pm

Perl would be good, 'specially for all that string processing/regexp stuff.

kayos · 19-03-2008 4:11pm

I got bored not exactly what you want but meh something you can build on if you fancy doing it in .NET.

Works more along the line of taking multiple string search expressions, a directory search and then pumps out all the matchs into a treeview grouped by Search Expression - File - Line.

So

Search Exp1
- File A
- Line X - Line text
- Line Y - Line text
- Line Z - Line text
- File B
- Line Z - Line text

Search Exp2
- File A
- Line Z - Line text
- File B
- Line Y - Line text
- Line Z - Line text

Feel free to mock, improve, question or rofl!

seamus · 25-03-2008 9:05pm

Damn, missed that kayos. I'll take a look at it in work tomorrow.

I wrote a script in PHP this afternoon to do this, or something like it. Takes forever (as someone pointed out in work to me, the complexity is O(n^2)) so I've left it running overnight to process 600-odd files. I'll post the script with comments on the performance tomorrow.

Basically aggregates a list of all files in a directory and its sub-directories and goes through each one-by-one to determine if any of the other files link to it...

seamus · 26-03-2008 12:46pm

OK, turns out that my script failed overnight. A couple of things I overlooked in my 45 minutes of hacking:
1. I assumed that all files were small.
2. I assumed that PHP could load in the entire contents of any file.

There was a 0.5Gb file in the directory which made it fall over when the script attempted to read it in.

So I added in a variable for the maximum filesize to open and it runs. I also tweaked the way I read in files. Instead of reading in an entire file, and then doing a preg_match, it reads in the files line-by-line and performs a preg_match on each line. This is faster. I don't know why, but I'm guessing it has something to do with having a smaller memory requirement (i.e. only having to store a line in memory instead of a whole file).

Scanned through my directory of 600+ files in about 30 minutes and gave me a list of 280 files that were no longer needed. :eek:

It's still a bit raw and there are a couple of enhancements that could be added to it to speed it up and make it more accurate:
1. Tell it to only scan certain file types - .php, .htm, .asp, etc. This currently reads through everything, images, log files and all. I noticed that log files tend to mention scripts/pages which are no longer in use, thus giving false negatives.
2. Tweak the regexps. It's not my forté so I did the best with what I know.

Let me know what ye think. I've hopefully commented it enough that it should be obvious on how to change it/run it. I've set it up to be used on the command line, not run as a web page, purely because of the length of time it takes - most if not all servers will time out.

(vBulletin doesn't like me uploading scripts with .php extensions

)

Defunct files/cleaning out dead wood

Comments