Regarding the practice of "content protection", I want to say something

Jun 21, 2021

This speech is relatively straightforward, but doing things really needs to be efficient

Manually copying and pasting one by one is really inefficient. If the goal is to save only (forget about curation and the like), you should try to use a more efficient method, not just "do it intentionally".

To put it bluntly, for a site with a lot of articles (such as FB or news sites, rather than scattered personal sites), instead of doing manual work and clipping, it is better to spend time working, and then use the earned money to hire a ~~third-rate~~ engineer to use / write simple It is more beneficial to save tools in batches.

What options are there to consider?

Submit URLs to sites like archive.org

Ways: Manually submit URLs, one-click submission with browser extensions , and batch submissions using command-line tools such as archivenow
Completeness: almost the same as the original page
Advantages: Applicable to almost all websites, no need to write programs, maintain servers
Disadvantages: frequency-limited submissions, reliance on a single site for stability and accessibility (prone to blockade)

Self-save WARC archives

Method: Use tools such as archivenow to generate archives
Completeness: almost the same as the original page
Advantages: Applicable to almost all websites, no need to write programs, no need to rely on a single site, files can be stored in a decentralized manner through IPFS
Disadvantages: large files, slow fetching speed, need to set up services such as IPWB for general users to browse

Customized crawler to generate HTML

Method: Write a crawler program, and decide what content you want to capture (plain text, including pictures...)
Completeness: depends on the crawler implementation
Advantages: Faster speed, smaller files, files can be stored in a decentralized manner through IPFS , and ordinary users can directly open them
Disadvantages: low page integrity, the cost of writing crawlers (and difficult to use across websites)

This time back up the plan I chose

Take the article on the backup of Apple in Hong Kong this time as an example, the first solution is excluded first, because the total number of articles is too large (> 1 million), and I do not want to rely too much on organizations/sites such as archive.org.

In terms of user experience, WARC should be an ideal solution. However, firstly, the crawling speed is low, and secondly, the file occupies a large space (about 5MB for a single page, and the multi-image article will be higher), so it is also skipped first.

The final solution is to write a crawler with plain text by itself. Although the integrity is low, the speed is quite fast (about 12 pages/second for 16 threads), and the space requirement is not high (basically <3BM per day), and the average user also Easy to download and browse.

How long did it take in total? (2016-01-01~2021-06-16, other years are still running)

Note: I am not a crawler, and I am not familiar with IPFS, so the efficiency is not high.

Writing program (first draft + adjustment details + DEBUG): no more than 3 hours
Running the crawler: about 1 day (16 threads are not used at the beginning, and the rerun time for halfway adjustment is included)
Build an IPFS node: 10 minutes
Add file to IPFS node PIN live: hours