catding
catding

專業廢文作者、不專業場外交易商 想把LIKE換臺幣、換港幣歡迎點進來 同時為馬特市自動化社交領導品牌(?) 興趣:調查、寫奇怪的文章、抓BUG

Regarding the practice of "content protection", I want to say something

This speech is relatively straightforward, but doing things really needs to be efficient

Manually copying and pasting one by one is really inefficient. If the goal is to save only (forget about curation and the like), you should try to use a more efficient method, not just "do it intentionally".

To put it bluntly, for a site with a lot of articles (such as FB or news sites, rather than scattered personal sites), instead of doing manual work and clipping, it is better to spend time working, and then use the earned money to hire a third-rate engineer to use / write simple It is more beneficial to save tools in batches.

What options are there to consider?

Submit URLs to sites like archive.org

  • Ways: Manually submit URLs, one-click submission with browser extensions , and batch submissions using command-line tools such as archivenow
  • Completeness: almost the same as the original page
  • Advantages: Applicable to almost all websites, no need to write programs, maintain servers
  • Disadvantages: frequency-limited submissions, reliance on a single site for stability and accessibility (prone to blockade)

Self-save WARC archives

  • Method: Use tools such as archivenow to generate archives
  • Completeness: almost the same as the original page
  • Advantages: Applicable to almost all websites, no need to write programs, no need to rely on a single site, files can be stored in a decentralized manner through IPFS
  • Disadvantages: large files, slow fetching speed, need to set up services such as IPWB for general users to browse

Customized crawler to generate HTML

  • Method: Write a crawler program, and decide what content you want to capture (plain text, including pictures...)
  • Completeness: depends on the crawler implementation
  • Advantages: Faster speed, smaller files, files can be stored in a decentralized manner through IPFS , and ordinary users can directly open them
  • Disadvantages: low page integrity, the cost of writing crawlers (and difficult to use across websites)

This time back up the plan I chose

Take the article on the backup of Apple in Hong Kong this time as an example, the first solution is excluded first, because the total number of articles is too large (> 1 million), and I do not want to rely too much on organizations/sites such as archive.org.

In terms of user experience, WARC should be an ideal solution. However, firstly, the crawling speed is low, and secondly, the file occupies a large space (about 5MB for a single page, and the multi-image article will be higher), so it is also skipped first.

The final solution is to write a crawler with plain text by itself. Although the integrity is low, the speed is quite fast (about 12 pages/second for 16 threads), and the space requirement is not high (basically <3BM per day), and the average user also Easy to download and browse.

How long did it take in total? (2016-01-01~2021-06-16, other years are still running)

Note: I am not a crawler, and I am not familiar with IPFS, so the efficiency is not high.
  • Writing program (first draft + adjustment details + DEBUG): no more than 3 hours
  • Running the crawler: about 1 day (16 threads are not used at the beginning, and the rerun time for halfway adjustment is included)
  • Build an IPFS node: 10 minutes
  • Add file to IPFS node PIN live: hours

other

CC BY-NC-ND 2.0

Like my work?
Don't forget to support or like, so I know you are with me..

Loading...
Loading...

Comment