Posts Tagged ‘website archive’

A very powerful feature of Cloud Preservation is its ability to collect external links. External links are links to web pages or documents that are outside of the website or social media feed being collected.

In terms of website feeds, Cloud Preservation determines if a link is external by comparing the address of the link to the addresses defined in your feed. In the context of Cloud Preservation social media feeds (such as Twitter or Facebook) an external link is a link that was found in a post from the social media feed.

Cloud Preservation provides you four configurable options for how it will manage external links. These options allow you to tailor your feeds to meet your collection needs and also provide a level of control over your feed’s storage use.

Option 1: Never collect external links

This option allows you to ignore offsite links entirely. When the Cloud Preservation crawler encounters a link that it determines to be external, it will record that link, but will not collect the web page at that link’s address. Since this option leaves these external pages out of your repository completely, these external links have no impact on your feed’s storage use.

When to use this option: There is no requirement to collect external pages, and/or there isn’t enough storage capacity for external pages in the Cloud Preservation repository for the selected plan.

Option 2: Never collect modified versions of external links

With this option selected, Cloud Preservation will look to see if it has ever collected this external link before, by comparing the address to all of the addresses of pages it has collected in the past. If it finds another page in the repository that bears this same address, then Cloud Preservation will simply link the existing page to the currently running crawl. Of all the options to collect external links, this has the lowest impact on storage for the repository.

When to use this option: There is a requirement to collect external pages, however the latest version isn’t important or of consequence. Often times for social media feeds like Twitter, the external page modifications aren’t relevant.  For example, the external link could be an article or blog post with constantly changing advertisements and user comments that aren’t important or relevant for your collection.

Option 3: Collect modified versions of external links for new or modified pages

If Cloud Preservation crawls an internal page that has not changed since the last collection, then it will not attempt to fetch the latest version of any external links. However, if the page has changed since the last collection, or is a page that has not been collected previously, then Cloud Preservation will check for new versions of all external links on that page. This option is slightly less efficient in terms of repository storage, but does offer savings over the final option.

Note: This is the default setting for new Cloud Preservation feeds, as we’ve found it to be the best choice for enhancing your collection with external links while keeping storage use in check.

When to use this option: There is a requirement to collect a “point in time” snapshot of both the internal pages and the external pages.

Option 4: Always collect the latest external link

Finally, this option will always attempt to fetch the latest version of the external link. If the link is found on a new internal page, modified internal page, or unmodified internal page, Cloud Preservation will crawl the external link to see if there is a new version. This option will have the largest impact on storage, as external pages frequently change due to rotating advertisements or images and changed content.

When to use this option: Useful when the latest version of offsite pages must be collected, always, and there is a surplus in storage capacity for the Cloud Preservation plan chosen. This option is also necessary for some advanced crawling techniques, such as using a single internal web page whose purpose is to provide an index of several external links.

The crawling process of Cloud Preservation can get complicated, just like the web, and we hope this sheds a bit of light on the subject of external links.

Read Full Post »

CloudPresrevation.com includes a very powerful search capability, so that you can gather quite a bit of information about your archived websites and social media.

In this post, we’ll walk you through some tips and tricks for using dates and crawl times to isolate documents that appeared in a specific timeframe.

What you should know about document dates in CloudPreservation.com

CloudPreservation requests information from Twitter, LinkedIn, and Facebook via their respective APIs. Because of the structured and predictable nature of these APIs, CloudPreservation.com is able to store these dates in it’s database, as well as it’s search index.

Since web pages don’t provide a date posted in a predictable manner, CloudPreservation cannot determine what date pages are posted on. Therefore, CloudPreservation does not have any data in it’s database or index for the document date.

However, CloudPreservation does crawl web sites at a specified intervals, so you can use these intervals to determine when a page was added, changed or deleted. The accuracy of this method is determined by how frequently your crawl interval is configured.

What this means is that with CloudPreservation you can search by document date for Twitter, LinkedIn or Facebook posts, and you can search web pages by using crawl frequency ranges.

With that, let’s look at some common search scenarios and how you’d execute that using CloudPreservation’s powerful search functionality.

Show me my what my social media feed looked like on a certain date

Often times you’d like to see what your social media feed looks like on a specific date. In this case, what you’d like to tell CloudPreservation.com is: “Show me all posts on or before this date, exclude offsite links, and order by date in reverse-chronological order.”

Using a combination of the date range search condition and a document type condition, CloudPreservation can deliver you this information. So, if you’d like to see what your social media feed looked like on June 22nd, 2011, you could construct your search like so:

document_date:[1970-01-01 2011-06-22] AND NOT document_type:"Web Page"

Once you have your results, you can order by document date in descending order.

Show me all new pages added in the last crawl

Sometimes you just want to see everything that’s new in your feed since the last time it was crawled. To do that, select a crawl from the crawl list below the search text box. Once a crawl is selected, a checkbox will show that allows you to restrict the search to pages that were created in the selected crawl.

Your results then should reflect any new pages, or pages that have changed since the crawl previous to the one selected. You can optionally enter a search term to narrow the results here as well.

See this blog post on the feature for further information.

Show me pages that are in one crawl but not the other

Sometimes you’d like to see the complement of a crawl, to determine what’s been removed between crawls. In this case, we build the search syntax like so:

crawl:"My Web Site - 2011-04-27 - 2011-05-28" AND NOT crawl:"My Web Site - 2011-05-28 - 2011-06-28"

To find out what exactly to put inside the quotes as the crawl name, you can copy the name of the crawls you are interested in from the crawl list below the search text box.

Note: Minimizing duplicates in your web site crawls enhances this report greatly. You can work with Nextpoint to build a customized SmartCrawl, which can filter out irrelevant changes between documents from crawl to crawl.

Show me the history of a page

One other common task is looking at the history of a page within CloudPreservation. By looking at the history, you can see what changed, and, depending on the feed’s crawl frequency setting, get a timeframe for when the page was added, updated or deleted.

To get the history of a page you need to peform a search based on the url of the page.


This will return all instances of this page that exist in CloudPreservation.com. You can view each of the results and see how the page has changed through time, get an idea of when it arrived on the site, or when it was removed from the site.

Note: Again, minimizing duplicates in your web site crawls enhances this report greatly.

Hopefully you’ll find these tips and tricks helpful when searching your feeds within CloudPreservation.com.


Read Full Post »