Archive for November, 2011

CloudPresrevation.com includes a very powerful search capability, so that you can gather quite a bit of information about your archived websites and social media.

In this post, we’ll walk you through some tips and tricks for using dates and crawl times to isolate documents that appeared in a specific timeframe.

What you should know about document dates in CloudPreservation.com

CloudPreservation requests information from Twitter, LinkedIn, and Facebook via their respective APIs. Because of the structured and predictable nature of these APIs, CloudPreservation.com is able to store these dates in it’s database, as well as it’s search index.

Since web pages don’t provide a date posted in a predictable manner, CloudPreservation cannot determine what date pages are posted on. Therefore, CloudPreservation does not have any data in it’s database or index for the document date.

However, CloudPreservation does crawl web sites at a specified intervals, so you can use these intervals to determine when a page was added, changed or deleted. The accuracy of this method is determined by how frequently your crawl interval is configured.

What this means is that with CloudPreservation you can search by document date for Twitter, LinkedIn or Facebook posts, and you can search web pages by using crawl frequency ranges.

With that, let’s look at some common search scenarios and how you’d execute that using CloudPreservation’s powerful search functionality.

Show me my what my social media feed looked like on a certain date

Often times you’d like to see what your social media feed looks like on a specific date. In this case, what you’d like to tell CloudPreservation.com is: “Show me all posts on or before this date, exclude offsite links, and order by date in reverse-chronological order.”

Using a combination of the date range search condition and a document type condition, CloudPreservation can deliver you this information. So, if you’d like to see what your social media feed looked like on June 22nd, 2011, you could construct your search like so:

document_date:[1970-01-01 2011-06-22] AND NOT document_type:"Web Page"

Once you have your results, you can order by document date in descending order.

Show me all new pages added in the last crawl

Sometimes you just want to see everything that’s new in your feed since the last time it was crawled. To do that, select a crawl from the crawl list below the search text box. Once a crawl is selected, a checkbox will show that allows you to restrict the search to pages that were created in the selected crawl.

Your results then should reflect any new pages, or pages that have changed since the crawl previous to the one selected. You can optionally enter a search term to narrow the results here as well.

See this blog post on the feature for further information.

Show me pages that are in one crawl but not the other

Sometimes you’d like to see the complement of a crawl, to determine what’s been removed between crawls. In this case, we build the search syntax like so:

crawl:"My Web Site - 2011-04-27 - 2011-05-28" AND NOT crawl:"My Web Site - 2011-05-28 - 2011-06-28"

To find out what exactly to put inside the quotes as the crawl name, you can copy the name of the crawls you are interested in from the crawl list below the search text box.

Note: Minimizing duplicates in your web site crawls enhances this report greatly. You can work with Nextpoint to build a customized SmartCrawl, which can filter out irrelevant changes between documents from crawl to crawl.

Show me the history of a page

One other common task is looking at the history of a page within CloudPreservation. By looking at the history, you can see what changed, and, depending on the feed’s crawl frequency setting, get a timeframe for when the page was added, updated or deleted.

To get the history of a page you need to peform a search based on the url of the page.


This will return all instances of this page that exist in CloudPreservation.com. You can view each of the results and see how the page has changed through time, get an idea of when it arrived on the site, or when it was removed from the site.

Note: Again, minimizing duplicates in your web site crawls enhances this report greatly.

Hopefully you’ll find these tips and tricks helpful when searching your feeds within CloudPreservation.com.


Read Full Post »