Web Archives FAQ
How are websites selected?
Subject specialists and web collection curators at The George Washington University Libraries select websites to be archived. Criteria for selection include relevance of the website to library collection development initiatives, current research and teaching, perceived risk that a website will disappear, and likelihood that a website will not be archived or preserved by other means. Websites affiliated with The George Washington University and organizations whose paper archives are held at GW will be additional priorities for web archiving.
GW campus members and partners may request captures of GW institutional websites and affiliated via the Request Form for Archiving GW-Affliated Website
What is your copyright policy for archiving websites?
We follow principles and techniques of non-intrusive harvesting. We respect the intellectual property rights of others. Some websites may contain material that is produced by other parties who may claim copyright ownership of such materials. GW reserves the right to remove any material that in our reasonable opinion may violate copyright or other intellectual property rights. Copyright holders, including third-party copyright holders, who believe their rights have been infringed by inclusion of their content in our archive may contact us at email@example.com.
How do you collect and store websites?
We use the open source web crawler, Heritrix, to create archival copies of the websites. Currently we are managing Heritrix through the Internet Archive's Archive-It service. All data created using the Archive-It service is hosted and stored by the Internet Archive. Our web archive is available at https://archive-it.org/organizations/825. In the future we may also store the data in GW libraries' digital repository.
Do website owners have to change or alter websites to be included in the crawls?
No, website owners do not have to change the content, structure, or appearance of their websites to be included in the crawls.
Will your crawling interfere with access to our website?
We crawl websites at a rate that will not to interfere with access to your website. Crawls will generally be run quarterly or semi-annually for actively updated websites, and last for a few days. Once a crawl is complete, the crawler no longer interacts with your server. If you encounter any issues or have any additional questions, please contact us at firstname.lastname@example.org.
Are you able to capture media, audio and video files?
Yes, downloadable media, audio and video files can usually be captured, although YouTube videos are challenging. Our crawler follows links in order to discover and capture content, so links to content must exist on a website in order for that content to be included in the archive. We cannot capture files that are not linked and have to be retrieved from a database via user query. (For example, a publications database that requires one to execute a search in order to access publications.) Streaming audio and video cannot be captured at all by the current generation of web crawlers.
How can I view the websites that have been archived? Will access always be free?
Archived websites will remain freely accessible to the public. Websites can be viewed by date of capture via our web archive hosted at the Internet Archive. Additional means of viewing archived websites may be explored by staff.
Why do the archived versions of some websites appear to be incomplete?
I would like my organization’s website to be removed from the GW Libraries web archive. Who do I contact?
We will refrain from harvesting websites whose owners do not wish to participate in this project and we will honor requests to remove archived content. Please contact email@example.com.
How can I learn more about your project?
Contact firstname.lastname@example.org for more information.
Columbia University Libraries https://library.columbia.edu/bts/web_resources_collection/faq.html