A1 website download and httrack

7/26/2023

Move the background image down a layer to fix the links. Turned out the background image is conflicting with the links! This appears to be a Discourse issue in a CSS file. This makes navigation impossible on category pages in HTTrack, and google catch. When viewing Category pages as googlebot, the thread links don’t work. Hence I turned some links into preformatted text. Note: Since I’m a new user, I’m limited to two links in a post. However for Windows users, the best solution appears to HTTrack. I also briefly tested GUI tools like Cyotek WebCopy, A1 Website Download, and WAIL.Ĭommand line tools include mcmcclur’s tool and wget. Many tools are listed here: Archive an old forum "in place" to start a new Discourse forum String: Mozilla/5.0 (compatible Googlebot/2.1 +).You can get a Chrome extension called “User-Agent Switcher for Chrome”. So you must change your user agent to googlebot.

uses the user agent of the person requesting the archive. If you use the “Save Page Now” feature on, it will archive the javascript version of Discourse with poor results. To do this, you must change your crawler / browser’s user agent to googlebot to get the HTML version of pages. The above challenges aren’t a concern once you learn that all Discourse pages/content can be rendered properly as HTML for crawlers. Pages can be loaded in basic HTML by adding ?_escaped_fragment_ to the end of a URL, but this trick only works for threads not categories.Saving a page to PDF won’t include any collapsed details sections.Adrelanos noted that multi-page threads weren’t being archived properly by HTTrack, however I suspect this issue was due to his HTTrack settings, as I did not have this issue.Users are limited to printing 5 pages an hour with print mode, but this limit can be increased by a Discourse site admin.

Pressing Ctrl+P loads a /print page with all posts visible.

Most threads only load with the first ~20? posts, the rest of the posts don’t appear until you scroll down.
This makes for poor results with most archive/crawler tools.
Discourse pages are dynamically generated with JavaScript.
I ran into the following challenges and eventually overcame them after much trial and error. To fix the links, simply block/delete the following CSS file: Note: There’s a CSS issue preventing category links from working, however that can easily be fixed as described below.
Mozilla/5.0 (compatible Googlebot/2.1 +).
I left the settings on default with the following custom settings. All categories, threads, and posts were archived including all pages with relative navigation links.Ī basic tutorial on HTTrack is here. This worked great and it archived the site to HTML files. Here’s everything I learned.įor Windows users, the best solution appears to HTTrack. I did a lot of research, trail and error, and I found a simple solution with HTTrack. A Discourse forum that I use is being taken offline in a couple weeks, so I set out to archive the site.

0 Comments

A1 website download and httrack

Leave a Reply.

Author

Archives

Categories