Data science and big data processing in R: representations and software

Septem Riza, Lala

Data science and big data processing in Rrepresentations and software

Septem Riza, Lala

Supervised by:

Francisco Herrera Triguero Director
José Manuel Benítez Sánchez Director

Defence university: Universidad de Granada

Fecha de defensa: 17 July 2015

Committee:

Antonio González Muñoz Chair
Manuel Gómez Olmedo Secretary
Matías Gámez Martínez Committee member
Luciano Sánchez Ramos Committee member
Antonio Peregrín Rubio Committee member

Type: Thesis

Teseo: 388363 DIALNET DIGIBUG editor

Abstract

The main objective of this thesis is the development of high quality and easy to use software modules for represent, create and manage system models and data analysis. Since it has become a de facto standard, R is the platform of choice. The mentioned packages consider the techniques based on fuzzy systems, rough sets, and fuzzy rough sets. In addition, a universal representation framework for fuzzy rule-based systems is introduced. Finally, the implementation of random forests and random ferns for tackling Big Data is discussed. According to these objectives, the following are results of the research: 1. The "frbs" package: It is an R package implementing the most relevant types of fuzzy rule-based systems along with a selection of machine-learning algorithms to build them. The package focuses on classification and regression tasks. It also includes a mechanism to allow the construction of a model by human experts. It is available in CRAN: http://cran.r-project.org/package=frbs and in the project website: http://sci2s.ugr.es/dicits/software/FRBS. 2. The "RoughSets" package: It is an R package implementing algorithms based on rough set theory and fuzzy rough set theory for knowledge representation and data analysis. In includes tools for managing missing values, discretization, feature selection, and instance selection, for both classification and regression tasks. It is available in CRAN: http://cran.r-project.org/package=RoughSets and in the project website: http://sci2s.ugr.es/dicits/software/RoughSets. 3. frbsPMML: It is a universal representation framework for fuzzy rule based systems based on the Predictive Model Markup Language. Furthermore, two software libraries to manage the representation are implemented: an extension of the "frbs"package and the Java package "frbsJpmml". 4. The "SparkFernTreeR" package: It is an R package implementing random forests and random ferns for dealing with Big Data processing. This package is developed on top of the Big Data frameworks: Apache Hadoop and Apache Spark.