Large-scale Data Mining and Machine Learning

Data mining & machine learning is one of our key research areas. We strive to build scalable and robust computer systems that can adapt and learn from bulk quantities of dynamic real-world data in a life-long learning manner. We develop algorithms and systems that can build accurate models for real-time, large-scale data streams from heterogeneous information sources.

The ability to collect data from various sensors, devices, and formats from independent or connected applications has significantly outpaced the ability to process, analyze, store and understand these datasets. Data can come from different channels, including the Internet, social media and networking sites, and in general, the Internet of things (IoT).

Our research topics include:

  • Scalable, distributed, and parallel algorithms.
  • New programming models for bulk data, beyond Hadoop/MapReduce and data streaming languages.
  • Mining algorithms for data in non-traditional formats (unstructured, semi-structured).
  • A unified model of data, modeling, and reasoning.
  • Mining from heterogeneous sources.
  • System issues related to large datasets: clouds, streaming, and beyond.
  • Cloud data mining for big data and stream data.
  • Interfaces for database systems and analytics.
  • Large-scale and real-time data visualization.
  • Privacy preservation, big data mining.