Botnet

Web scraping

What is this site? It is generaly simplier version of wikipedia. You will find there selected articles. Enjoy!

This article may require cleanup to meet Wikipedia's quality standards. Please improve this article if you can. (June 2007)

Web scraping (also called Web harvesting or Web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding certain full-fledged Web browsers, such as the Internet Explorer (IE) and the Mozilla Web browser. Web scraping is closely related to Web indexing, which indexes Web content using a bot and is a universal technique adopted by most search engines. In contrast, Web scraping focuses more on the transformation of unstructured Web content, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to Web automation, which simulates human Web browsing using computer software. Exemplary uses of Web scraping include online price comparison, weather data monitoring, website change detection, Web research, Web content mashup and Web data integration.

Contents

Techniques for Web scraping

Web scraping is the process of automatically collecting Web information. Web scraping is a field with active developments sharing a common goal with the semantic Web vision, an ambitious initiative that still requires breakthroughs in text processing, semantic understanding, artificial intelligence and human-computer interactions. Web scraping, instead, favors practical solutions based on existing technologies even though some solutions are entirely ad hoc. Therefore, there are different levels of automations that existing Web-scraping technologies can provide:

Legal issues

Web scraping may be against the terms of use of some websites. The enforceability of these terms is unclear. While outright duplication of original expression will in many cases be illegal, in the United States the courts ruled in Feist Publications v. Rural Telephone Service that duplication of facts is allowable. Also, in a February, 2006 ruling, the Danish Maritime and Commercial Court (Copenhagen) found systematic crawling, indexing and deep linking by portal site ofir.dk of real estate site Home.dk not to conflict with Danish law or the database directive of the European Union.

U.S. courts have acknowledged that users of "scrapers" or "robots" may be held liable for committing trespass to chattels, which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. However, to succeed on a claim of trespass to chattels, the plaintiff must demonstrate that the defendant intentionally and without authorization interfered with the plaintiff's possessory interest in the computer system and that the defendant's unauthorized use caused damage to the plaintiff. Not all cases of web spidering brought before the courts have been considered trespass to chattels.

In Australia, the Spam Act 2003 outlaws some forms of web harvesting.

Technical measures to stop bots

The administrator of a website can use various measures to stop or slow a bot. Some techniques include:

See also

Notes

  1. ^ Semantic annotation based web scraping
  2. ^ "FAQ about linking - Are website terms of use binding contracts?". www.chillingeffects.org. 2007-08-20. http://www.chillingeffects.org/linking/faq.cgi#QID596. Retrieved 2007-08-20. 
  3. ^ "UDSKRIFT AF SØ- & HANDELSRETTENS DOMBOG". bvhd.dk. 2006-02-24. http://www.bvhd.dk/uploads/tx_mocarticles/S_-_og_Handelsrettens_afg_relse_i_Ofir-sagen.pdf. Retrieved 2007-05-30. 
  4. ^ "Internet Law, Ch. 06: Trespass to Chattels". www.tomwbell.com. 2007-08-20. http://www.tomwbell.com/NetLaw/Ch06.html. Retrieved 2007-08-20. 
  5. ^ "What are the "trespass to chattels" claims some companies or website owners have brought?". www.chillingeffects.org. 2007-08-20. http://www.chillingeffects.org/linking/faq.cgi#QID460. Retrieved 2007-08-20. 
  6. ^ "Ticketmaster Corp. v. Tickets.com, Inc.". 2007-08-20. http://www.tomwbell.com/NetLaw/Ch07/Ticketmaster.html. Retrieved 2007-08-20. 
  7. ^ National Office for the Information Economy (February 2004). "Spam Act 2003: An overview for business" (PDF). Australian Communications Authority. pp. 6. http://www.acma.gov.au/webwr/consumer_info/spam/spam_overview_for%20_business.pdf. Retrieved 2009-03-09. 
  8. ^ National Office for the Information Economy (February 2004). "Spam Act 2003: A practical guide for business" (PDF). Australian Communications Authority. pp. 20. http://www.acma.gov.au/webwr/consumer_info/frequently_asked_questions/spam_business_practical_guide.pdf. Retrieved 2009-03-09. 

References

  • Schrenk, Michael (2007). Webbots, Spiders, and Screen Scrapers. No Starch Press. ISBN 978-1-59327-120-6. http://www.nostarch.com/frameset.php?startat=webbots. 

External links

Retrieved from "http://en.wikipedia.org/wiki/Web_scraping"

All text are available under the terms of the GNU Free Documentation License. Hope this site help you/
alibista - zdrowie z natury alveo - po szkole gry logiczne za darmo - gry - Ostatnie wolne miejsca włochy last minut zniżki dla rodzin z dziećmi - Serwis w stylu moto gratka i otomoto - ogłoszenia auto moto. - wycieczka do Indii - siłownia oraz fitness poznań najlepszy aerobik - gry online - gry online - komputery - transport osobowy - Najlepsze Dieta forum o tematyce diety - Zapraszamy na megavideo bez limitu Jest to najlepszy serwis w sieci - Wyjątkowa okazja super last minute na sycylii duże rabaty dla stałych klientów SErr:128 SErr:128