皇上,还记得我吗?我就是1999年那个Linux伊甸园啊-----24小时滚动更新开源资讯,全年无休!

Apache Nutch 1.14 发布,Web 爬虫

 

Nutch

Apache Nutch 1.14 发布了。Nutch 是一个成熟的、可用于生产的 Web 爬虫。 Nutch 1.x 可以依靠 Apache Hadoop™ 数据结构进行细粒度配置,这对于批处理非常有用。

更新内容:

Bug 修复

  • [NUTCH-2071] – A parser failure on a single document may fail crawling job
  • [NUTCH-2235] – Classpath discrepancy with protocol-selenium in deploy mode
  • [NUTCH-2269] – Clean not working after crawl
  • [NUTCH-2295] – Nutch master docker container broken
  • [NUTCH-2297] – CrawlDbReader -stats wrong values for earliest fetch time and shortest interval
  • [NUTCH-2316] – Library conflict with Parser-Tika Plugin and Lib Folder

提升

  • [NUTCH-1763] – Improving comments on the Injector Class
  • [NUTCH-2034] – CrawlDB filtered documents counter.
  • [NUTCH-2035] – Regex filter using case sensitive rules.
  • [NUTCH-2046] – The crawl script should be able to skip an initial injection.
  • [NUTCH-2135] – Ant Eclipse build does not include protocol-interactiveselenium
  • [NUTCH-2193] – Upgrade feed parser plugin to use rome 1.5

完整更新内容请查看 发布说明

下载地址:

转自 http://www.oschina.net/news/91887/nutch-1-14

分享到:更多 ()