Traversing all files under a Directory with java.util.File

evertwagenaar.com logo

Today I had this requirement.

I have a large collection of HTML on my local SSD. I downloaded a complete site with the wget command in order to do this  I used the command below:

wget --mirror --convert-links --adjust-extension --page-requisites 
--no-parent http://www.yoursite.org

This downloads the full website to your current directory, including scripts, CSS-files, graphics and videos. Then I needed to access all HTML files to clean them. I wanted to use Java so had to write a recursive function. After some attempts I got the following code. After the weekend I hope to be finished with the project and put the results online.

// Process all files and directories under dir
public static void visitAllDirsAndFiles(File dir) {
    process(dir);

    if (dir.isDirectory()) {
        String[] children = dir.list();
        for (int i=0; i<children.length; i++) {
         //Recursive call:
           visitAllDirsAndFiles(new File(dir, 
           children[i]));
        }
    }
}

// Process only directories under dir
public static void visitAllDirs(File dir) {
    if (dir.isDirectory()) {
        process(dir);

        String[] children = dir.list();
        for (int i=0; i<children.length; i++) {
            visitAllDirs(new File(dir, children[i]));
        }
    }
}

// Process only files under dir
public static void visitAllFiles(File dir) {
    if (dir.isDirectory()) {
        String[] children = dir.list();
        for (int i=0; i<children.length; i++) {
            visitAllFiles(new File(dir, children[i]));
        }
    } else {
        process(dir);
    }
}

In my case I only need some parts of the HTML code, which I want to transfer to MySQL. (The project I’m working on is a webscraping project).

I hope to be able to show you more details after the weekend. So stay tuned!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.