Today I had this requirement.
I have a large collection of HTML on my local SSD. I downloaded a complete site with the wget command in order to do this I used the command below:
wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://www.yoursite.org
This downloads the full website to your current directory, including scripts, CSS-files, graphics and videos. Then I needed to access all HTML files to clean them. I wanted to use Java so had to write a recursive function. After some attempts I got the following code. After the weekend I hope to be finished with the project and put the results online.
// Process all files and directories under dir public static void visitAllDirsAndFiles(File dir) { process(dir); if (dir.isDirectory()) { String[] children = dir.list(); for (int i=0; i<children.length; i++) { //Recursive call: visitAllDirsAndFiles(new File(dir, children[i])); } } } // Process only directories under dir public static void visitAllDirs(File dir) { if (dir.isDirectory()) { process(dir); String[] children = dir.list(); for (int i=0; i<children.length; i++) { visitAllDirs(new File(dir, children[i])); } } } // Process only files under dir public static void visitAllFiles(File dir) { if (dir.isDirectory()) { String[] children = dir.list(); for (int i=0; i<children.length; i++) { visitAllFiles(new File(dir, children[i])); } } else { process(dir); } }
In my case I only need some parts of the HTML code, which I want to transfer to MySQL. (The project I’m working on is a webscraping project).
I hope to be able to show you more details after the weekend. So stay tuned!