皇上,还记得我吗?我就是1999年那个Linux伊甸园啊-----24小时滚动更新开源资讯,全年无休!

Apache Nutch 1.1.3 发布,Web 爬虫

Apache Nutch 1.1.3 发布,Web 爬虫

Apache Nutch 项目管理委员宣布 Apache Nutch 1.13 发布,建议所有当前的用户和 1.X 系列的开发人员升级到此版本。

Nutch是一个成熟的、可用于生产的 Web 爬虫。 Nutch 1.x 可以依靠 Apache Hadoop™ 数据结构进行细粒度配置,这对于批处理非常有用。

更新内容:

Sub-task

  • [NUTCH-2246] – Refactor /seed endpoint for backward compatibility

Bug

  • [NUTCH-1553] – Property ‘indexer.delete.robots.noindex’ not working when using parser-html.
  • [NUTCH-2242] – lastModified not always set
  • [NUTCH-2291] – Fix mrunit dependencies
  • [NUTCH-2337] – urlnormalizer-basic to strip empty port
  • [NUTCH-2345] – FetchItemQueue logs are logged with wrong class name
  • [NUTCH-2349] – urlnormalizer-basic NPE for ill-formed URL “http:/”
  • [NUTCH-2357] – Index metadata throw Exception because writable object cannot be cast to Text
  • [NUTCH-2359] – Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed
  • [NUTCH-2364] – http.agent.rotate: IllegalArgumentException / last element of agent names ignored
  • [NUTCH-2366] – Deprecated Job constructor in hostdb/ReadHostDb.java

改进

  • [NUTCH-1308] – Add main() to ZipParser
  • [NUTCH-2164] – Inconsistent ‘Modified Time’ in crawl db
  • [NUTCH-2234] – Upgrade to elasticsearch 2.3.3
  • [NUTCH-2236] – Upgrade to Hadoop 2.7.2
  • [NUTCH-2262] – Utilize parameterized logging notation across Fetcher
  • [NUTCH-2272] – Index checker server to optionally keep client connection open
  • [NUTCH-2286] – CrawlDbReader -stats to show fetch time and interval
  • [NUTCH-2287] – Indexer-elastic plugin should use Elasticsearch BulkProcessor and BackoffPolicy
  • [NUTCH-2299] – Remove obsolete properties protocol.plugin.check.*
  • [NUTCH-2300] – Fetcher to optionally save robots.txt
  • [NUTCH-2327] – Seeds injected in REST workflow must be ingested into HDFS
  • [NUTCH-2329] – Update Slf4j logging for Java 8 and upgrade miredot plugin version
  • [NUTCH-2336] – SegmentReader to implement Tool
  • [NUTCH-2352] – Log with Generic Class Name at Nutch 1.x
  • [NUTCH-2355] – Protocol plugins to set cookie if Cookie metadata field is present
  • [NUTCH-2367] – Get single record from HostDB

新特性

  • [NUTCH-2132] – Publisher/Subscriber model for Nutch to emit events

Task

  • [NUTCH-2171] – Upgrade Nutch Trunk to Java 1.8

下载地址:

http://nutch.apache.org/downloads.html

转自 http://www.oschina.net/news/83494/nutch-1-1-3