Text Mining Projects Introduction

Current projects of Leon lee:

PageRank Algorithm research and development; Phrase Recognizer , BodyText Extractor ===> Wiki Alias Mining (hadoop version), Html Keywords Extractor

Future projects of Leon lee:

HtmlClassifier/UrlClassifier, NewsEntityRecognizer, News Aggregator and Classifier

Text Mining Projects Introduction

目录

[隐藏]

  • 1Overview
  • 21. HtmlClassifier / UrlClassifier
  • 32. NewsEntityRecognizer
  • 43. BodyText Extractor
  • 54. Wiki Alias Mining
  • 65. Phrase Recognizer
  • 76. Lucene Ranking Algorithm Incubator.
  • 87. Collaborative Filtering
  • 98. Html Phrase Extractor
  • 109. News Aggregator and Classifier
  • 119. PageRank Algorithms Research & Development

Overview

Goal(目的): 利用以下各专业领域知识,采用各类开源软件,进行文本的挖掘与分析,为搜索引擎提供服务支持。

Fields(领域): Machine Learning, Natural Language Processing, Information Retrieval, Information Extraction, Distributed Computing, Statistics, etc. 机器学习,自然语言处理,信息检索,信息抽取, 分布式处理算法,统计等专业领域知识.

采用的开源软件包括: Weka, LingPipe, Mahout, Lucene, Solr, Hadoop 等。

采用的具体技术包括: 分类(Classification), 聚类(Cluster), 命名实体识别(Name Entity Recognization), Latent Dirichlet allocation(LDA), 相关度排序(Relevance Ranking),查询扩展(Query expansion),神经网络(Artificial Neural Networks),协同过滤/推荐系统(Collaborative Filtering), etc.

开发顺序: Phrase Recognizer , BodyText Extractor ===> Wiki Alias Mining, HtmlClassifier, NewsEntityRecognizer,Html Keywords Extractor, News Aggregator and Classifier

1. HtmlClassifier / UrlClassifier

采用各类分类器,提供对一个网页或者网站的类型分析。通过公开的dmoz.org之类的网站,或者手工构造训练数据集,采用BodyText Extractor的主要内容抽取技术从html中分析出去噪音的文本,使用ngram或基于词的基本特征,加上各类辅助用的有效特征(如url,metadata,title等),进行训练,生成适用于不同目的的分类器。

应用:例如新闻分类器,通用网站分类器,单网页分类器等。

Development Information:

Html/Url Classifier
HtmlClassifier (developing):
svn://172.0.1.252/opt/svnroot/search_engine/TextMining/HtmlClassifier
Running Program(shell, training set): 172.0.1.248:/opt/Classifier

2. NewsEntityRecognizer

采用经 RSS feed 订阅抓取下来的各类新闻网页,经过去噪音(广告,导航栏,版权地址信息,评论)等,对新闻主体内容进行分析,识别出命名实体或者预存在词典中的特定如商品名称等信息。

应用: 可以用于检索时查询词相关新闻列出。

Development Information:

News Entity Recognizer
NewsEntityRecognizer (developing)
svn://172.0.1.252/opt/svnroot/search_engine/TextMining/NewsEntityRecognizer
Running Program(shell, data): 172.0.1.248:/opt/EntityAnalysis

3. BodyText Extractor

分析网页,去噪音(广告,导航栏,版权地址信息,评论)等,通过提取网页纯文本与html tag的统计信息,计算密度,构造一个网页与文本的训练集,经过训练,然后通过神经网络或者一般分类器来决定那些文本是网页的正文内容。 得到有用的纯文本后便于进行下一步的文本分析处理。

应用: 网页分析,提取正文的基础

被依赖项目: HtmlClassifier, NewsEntityRecognizer, Html Keywords Extractor

Development Information:

BodyText Extractor
UsefulTextExtractor (funditional component, todo)
svn://172.0.1.252/opt/svnroot/search_engine/TextMining/UsefulTextExtractor
English news body text extractor see python code in Running Program of NewsEntity Recognizer project.
BodyTextExtractor (optimized python code, done)
svn://172.0.1.252/opt/svnroot/search_engine/TextMining/UsefulTextExtractor/branches/BodyTextExtractorPython
or 172.0.1.248:/home/lij/projects/BTE_0.3.tgz

4. Wiki Alias Mining

  通过利用wiki的丰富资源,如articles, redirects, categories, links 等信息, 通过 Latent Dirichlet allocation等进行聚类分类,得到常用的,数量合适的适用于电子商务领域的别名。
为了处理wiki这种大规模的数据和减少实验时间,快速调整参数进行迭代实验,需要在基于hadoop上层开发的机器学习库Mahout上运行。

应用: 用于提供搜索引擎使用的基础数据: 如相关搜索词服务,查询扩展(query expansion)等。

  • Wiki aliases mining solution
  • Merits and drawbacks of Wiki aliases mining & corresponding solution

Development Information:

Wiki Aliases Mining
LDA-Cluster   ( workflow document see http://172.0.1.252/mediawiki/index.php/Wiki_aliases_mining_solution )
svn://172.0.1.252/opt/svnroot/search_engine/TextMining/LDA-Cluster
Data File:  172.0.1.249:/home/lij/download/download.freebase.com/wex
Hadoop Version  (todo)

5. Phrase Recognizer

  分析文章中的词组,使用有效的算法进行文本中的词组的识别,为其他应用提供基础功能,也可以使用统计的方法补充数据集来识别热点。

应用: 词处理; 热点词语即不同时期的热点话题。

被依赖项目: HtmlClassifier, NewsEntityRecognizer, Wiki Alias Mining, Html Keywords Extractor

Development Information:

NameEntity and Dictionary based lookup see code in NewsEntityRecognizer project
Research, Data & code (todo)

6. Lucene Ranking Algorithm Incubator.

  检索结果排序是搜索引擎的核心内容,为了持续提高搜索服务质量,在一个小的数据集上搭建lucene/solr的检索环境,编写调试实验性的ranking算法,分析各类影响因素(产品流行度,爬取信息质量,网站知名度,用户点击日志分析,链接等),分析调整设置综合score的参数数值。 得到阶段性成果后将成熟的算法和参数整合进入线上对外服务器。

应用: 将最相关的信息排在搜索结果的前面

Development Information:

setup demo environment on 248 host (todo)

7. Collaborative Filtering

应用: 用户查看单独一个商品时,提供相关商品推荐

Development Information:

todo

8. Html Phrase Extractor

 分析网页,去噪音,提取出代表性的词或词组包括新词,可能需要命名实体识别,新词发现,统计对比,文本摘要等技术。

Crawler Team的需求可能是抽取出 所有的词组,而不是代表这篇文章的少数几个Tag。

应用: 输入一个网页,提取出关键词或词组

Development Information:

workflow:
1. extract main text from html, remove navigational elements, templates, and advertisements.
2. chunking sentences from main text.
3. POS tagging for sentences.
4. noun , verb phrases chunking
5. stop word filtering.
  research techniques on keywords extraction, phrase extraction, content extraction, boilerplate detection.
coding, testing, debuging.
still need improving by research latest techniques and modify source code.
using technqiues: pos tagging, sentence chunking, phrase chunking, boilerplate detection, full-text extraction.
  References: (part of papers and references which are searched and selected from hundreds resources)
C. Kohlschütter. A densitometric analysis of web template content. In WWW '09: Proc. of the 18th intl. conf. on World Wide Web, New York, NY, USA, 2009.ACM.
C. Kohlschütter and W. Nejdl. A Densitometric Approach to Web Page Segmentation. In ACM 17th Conf. on Information and Knowledge Management (CIKM 2008),2008. 1173-1182
Christian Kohlschütter, Peter Fankhauser and Wolfgang Nejdl,Boilerplate Detection using Shallow Text Features,WSDM 2010 --The Third ACM International 	Conference on Web Search and Data Mining New York City, NY USA.
Weninger, T. and Hsu, W. H. Text Extraction from the Web via Text-to-Tag Ratio. In Proceedings of the 2008 19th international Conference on Database 	and Expert Systems Application. DEXA. IEEE Computer Society, Washington, DC, 23-28.
Pasternack, J. and Roth, D. 2009. Extracting article text from the web with maximum subsequence segmentation. In Proceedings of the 18th international Conference on World Wide Web (Madrid, Spain, April 20 - 24, 2009). WWW '09. ACM, New York, NY, 971-980.
J. Gibson, B. Wellner, and S. Lubar. Adaptive web-page content identification. In WIDM '07: Proceedings of the 9th annual ACM international workshop on Web information and data management, pages 105--112, New York, NY, USA, 2007. ACM.
http://www.l3s.de/~kohlschuetter/boilerplate/
http://alias-i.com/lingpipe/demos/tutorial/posTags/read-me.html
http://alias-i.com/lingpipe/demos/tutorial/sentences/read-me.html

9. News Aggregator and Classifier

Aggregate/Cluster news and classified into pre-defined categories, where aggregation is entirely automatic, using algorithms which carry out contextual analysis and group similar stories together.

Development Information:

todo

9. PageRank Algorithms Research & Development

 PageRank is query-independent, which means it produces a global ranking of the importance of all pages in index of billion pages.
 Research, experiment and development, incorporate into search engine.

发表回复