Day :
- Big Data Analytics, Big Data Algorithm, Business Analytics
Session Introduction
Farag Hamed Kuwil
Karabuk University, Turkey
Title: Two New Algorithms, Critical Distance Clustering and Gravity Center Clustering
Biography:
Farag Kuwil is a PhD student in Karabuk University. He has invented two data clustering algorithms. One published in Expert System with Application journal and the second in the first round of review
Abstract:
We developed a new algorithm based on Euclidean distance among data points and employing some mathematical statistics operations and called it critical distance clustering (CDC) algorithm (Kuwil, Shaar, Ercan Topcu, & Murtagh, Expert Syst. Appl., 129 (2019) 296–310. https://authors.elsevier.com/a/1YwCc3PiGTBULo). CDC works without the need of specifying parameters a priori, handles outliers properly and provides thorough indicators for clustering validation. Improving on CDC, we are on the verge of building second generation algorithms that are able to handle larger size objects and dimensions dataset.
Our new unpublished Gravity Center Clustering (GCC) algorithm falls under partition clustering and is based on gravity center "GC" and it is a point within cluster and verifies both the connectivity and coherence in determining the affiliation of each point in the dataset and therefore, it can deal with any shape of data, lambda is used to determine the threshold and identify the required similarity inside clusters using Euclidean Distance. Moreover, two coefficients lambda and n provide to the observer some flexibility to control over the results dynamically (parameters and coefficients are different, so, in this study, we assume that existing parameters to implement an algorithm as disadvantage or challenge, but existing coefficient to get better results as advantage), where n represents the minimum number of points in each cluster and lambda is utilized to increase or decrease number of clusters. Thus, lambda and n are changed from the default value in case of addressing some challenges such as outliers or overlapping.
Li-Hsin Jhan
National Chung Hsing University, Taichung, Taiwan
Title: Investigating the association between the flooding tolerance genes of soybean by pathway analysis and network analysis
Biography:
Li-Hsin Jhan is a Master student in the Department of Agronomy at National Chung Hsing University (Taiwan). He majors in bioinformation and biostatistics. In the past three years, he made efforts at studying abiotic stress of soybean using systems biology methods to explore mechanisms of flooding stress tolerance.
Abstract:
Under the extreme climate conditions, the events of crop damage are increasing. There is an urgent need to breed stress-tolerant varieties. Flooding stress on different growth stages of soybean can negatively affect seed germination, plant growth, flowering, yield and quality. These impacts are linked with the ability of plant adaptation or tolerance to flooding stress, which involves with complex physiological traits, metabolic pathways, biological processes, molecular components and morphological adaptations. However, investigating mechanisms of flooding stress tolerance is time-consuming. In the present study, we conducted systems biology approaches to identify pathways and network hubs linking flooding stress tolerance. We previously identified 63 prioritied flooding tolerance genes (FTgenes) of soybean from multiple dimensional data sources using large-scale data mining and gene prioritization methods. We conducted competitive (using hypergeometric test) and self-contained (using SUMSTAT) approaches of gene-set enrichment analysis, using gene ontology (GO) database, and found 20 significantly enriched pathways by hypergeometric test and 20 significantly enriched pathways by SUMSTAT. These GO pathways were further compared to seven candidate pathways that identified by gene regulatory pathway databases collected from NCBI PubMed. The FTgenes were found being resist flooding stress in these significantly enriched pathways, which form a module through a closely linked pathway crosstalk network. The module was associated to ethylene biosynthesis, jasmonic acid biosynthesis, abscisic acid biosynthesis, and phosphorylation pathway. The systems biology methods may provide novel insight into the FTgenes and flooding stress tolerance.
Fionn Murtagh
University of Huddersfield, UK
Title: Analytical Focus and Contextuality, Exploiting Resolution Scale, Ad-dressing Bias
Biography:
BA Mathematics, BAI Engineering Science, from Trinity College Dublin, MSc,information retrieval. PhD, Doctorat de troisi_eme cycle, from Professor Jean-Paul Benz_ecri, Mathematical Statistics, in Universit_e Pierre et Marie Curie,Paris 6. HDR degree, "Pattern Recognition in Astronomy", from what is now Universit_e de Strasbourg. In 1984, I was a visiting scientist, in the Joint Re-search Centre, Centro comune di ricerca, Ispra, Italy. For 12 years, Senior Sci-entist in the European Space Agency, for the Hubble Space Telescope, based in the European Southern Observatory, in Garching, Munich, Germany. I have 28 books published, many edited; 175 journal papers; 42 survey and contributed articles in books, 127 papers in conference proceedings and edited volumes. Listed on my web page, www.fmurtagh.info, are publications, Membership and Fellowship of many organisations; the organisations where I am the Chair, or Board member or Council member; medals awarded; and journal editorial work.
Abstract:
Examples are provided of the following. The Correspondence Analysis, also termed Geometric Data
Analysis, platform, exploiting conceptual resolution scale, and having both analytical focus and contextualization,this semantically maps qualitative and quantitative data. Big Data analytics has new challenges and opportunities, and key factors are security through aggregation and ethical accuracy of individual mapping; and process-wise, this is multi-resolution analysis carried out. For the analytical topology of the data, from hierarchical clustering, the following is developed, with properties noted here, and essentially with linear time computational complexity. For text mining, and also for medical and health analytics, the analysis determines a divisive, ternary (i.e. p-adic where p = 3) hierarchical clustering from factor space mapping. Hence the topology (i.e. ultrametric topology, here using a ternary hierarchical clustering), related to the geometry of the data (i.e. the Euclidean metric endowed factor space, semantic mapping, of the data, from Correspondence Analysis). Determined is the differentiation in Data Mining of what is both exceptional and quite unique relative to what is both common and shared, and predominant. A major analytical theme, started now, is for Mental Health, with analytical focus and contextualization, with the objective for interpretation of mental capital. Another analytical theme is to be for developing economies.
Smaranya Dey
Walmart Labs, Bangalore, India
Title: Surge-Adjusted Forecasting in Temporal Data Containing Extreme Observations
Biography:
Smaranya Dey is a Data Scientist in Walmart Labs, India. She is interested in researching areas of forecasting analysis and natural language processing. Anirban Chatterjee is Staff Data Scientist in Walmart Labs, India. His research interest lies in Time Series Analysis and Modeling with High Dimensional Data.
Abstract:
Forecasting in time-series data is at the core of various business decision making activities. One key characteristic of many practical time series data of different business metrics such as orders, revenue, is the presence of irregular yet moderately frequent spikes of very high intensity, called extreme observation. Forecasting such spikes accurately is crucial for various business activities such as workforce planning, financial planning, inventory planning. Traditional time series forecasting methods such as ARIMA, BSTS, are not very accurate in forecasting extreme spikes. Deep Learning techniques such as variants of LSTM tend to perform only marginally better than these traditional techniques. The underlying assumption of thin tail of data distribution is one of the primary reasons for such models to falter on forecasting extreme spikes as moderately frequent extreme spikes result in heavy tail of the distribution. On the other hand, literatures, proposing methods to forecast extreme events in time series, focused mostly on extreme events but ignored overall forecasting accuracy. We attempted to address both these problems by proposing a technique where we considered a time series signal with extreme spikes as the superposition of two independent signals - (1) a stationary time series signal without extreme spike (2) a shock signal consisting of near-zero values most of the time along with few spikes of high intensity. We modelled the above two signals independently to forecast values for the original time series signal. Experimental results show that the proposed technique outperforms existing techniques in forecasting both normal and extreme events.
Danchen Wang
Peking Union Medical College Hospital, China
Title: E-BABE- Data mining: seasonal and temperature fluctuations in thyroid-stimulating hormone
Biography:
Ling Qiu has completed his master degree in Peking Union Medical College Hospital. She is the director of Danchen Wang. Dr Qiu has published more than 40 papers in reputed journals.
Abstract:
Background: Thyroid-stimulating hormone (TSH) plays a key role in maintaining normal thyroid function. Here, we used “big data” to analyze the effects of seasonality and temperature on TSH concentrations to understand factors affecting the reference interval.
Methods: Information from 339,985 patients at Peking Union Medical College Hospital was collected from September 1st, 2013, to August 31st, 2016, and retrospectively analyzed. A statistical method was used to exclude outliers, with data from 206,486 patients included in the final analysis. The research period was divided into four seasons according to the National Weather Service. Correlations between TSH concentrations and season and temperature were determined.
Results: Median TSH levels during spring, summer, autumn, and winter were 1.88, 1.86, 1.87, and 1.96 ïIU/L, respectively. TSH fluctuation was larger in winter (0.128) than in summer (0.125). After normalizing the data from each year to the lowest TSH median value (summer), TSH appeared to peak in winter and trough in summer, showing a negative correlation with temperature. Pearson correlation analysis indicated that the monthly median TSH values were negatively correlated with temperature (r = −0.663, p < 0.001).
Conclusions: This study showed significant seasonal- and temperature-dependent variation in TSH concentrations. Thus, these might be important factors to consider when diagnosing thyroid function disorders.
Chung-Feng Kao
National Chung Hsing University, Taichung, Taiwan
Title: E-BABE- A comprehensive framework of gene prioritization for flooding tolerance in soybean
Biography:
Chung-Feng Kao has completed his PhD at the age of 36 years from Lancaster University (UK) and postdoctoral studies from National Taiwan University (Taiwan). He is the assistant professor of National Chung Hsing University. He has published more than 30 papers in reputed journals and has been serving as an editorial board member of Frontiers.
Abstract:
Soybean [Glycine max (L.) Merr] is rich in protein and oil, which is one of the most important crops around the world. Drastic and extreme changes in global climate has led to decreasing production of crops, deterioration of quality, increasing plant diseases and insect pests, which resulted in economic losses. Facing such a harsh circumstance, a seed which is less susceptible to stresses, both abiotic and biotic, is urgently needed. The present study proposes a comprehensive framework, including phenotype-genotype data mining, integration analysis, gene prioritization and systems biology, to construct prioritized genes of flooding tolerance (FTgenes) in soybean to develop a fast-precision breeding platform for variety selection of important traits in soybean. We applied big data analytic strategies to mine flooding tolerance related data in soybean, both phenomic and genomic, from cloud-based text mining across different data sources in the NCBI. We conducted meta-analysis and gene mapping to integrate huge information collected from multiple dimensional data sources. We developed a prioritization algorithm to precisely prioritize a collection of candidate-genes of flooding tolerance. As a result, 219 FTgenes were selected, based on the optimal cutoff-point of combined score, from 35,970 prioritized genes of soybean. We found the FTgenes were significantly enriched with response to wounding, chitin, water deprivation, abscisic acid, ethylene and jasmonic acid biosynthetic process pathways, which play important role in biosynthesis of plant hormone in soybean. Our results provide valuable information for further studies in breeding commercial varietie.
Biography:
This research work is concerned with the study of the General Efficiency and Optimization. We present the Efficiency and Optimization University of Bacău in their most natural context offered by the Infinite Dimensional Ordered Vector Spaces, following our recent results on these subjects. Implications and Applications in Vector Optimization through of the agency of Isac’s Cones and the new link between the General Efficiency and the Strong Optimization by the Full Nuclear Cones are presented. An important extension of our Coincidence Result between the Efficient Points Sets and the Choquet Boundaries is developed. In this way, the Efficiency is connected with Potential Theory by Optimization and conversely. Several pertinent references conclude this investigation
Abstract:
This research work is concerned with the study of the General Efficiency and Optimization. We present the Efficiency and Optimization University of Bacău in their most natural context offered by the Infinite Dimensional Ordered Vector Spaces, following our recent results on these subjects. Implications and Applications in Vector Optimization through of the agency of Isac’s Cones and the new link between the General Efficiency and the Strong Optimization by the Full Nuclear Cones are presented. An important extension of our Coincidence Result between the Efficient Points Sets and the Choquet Boundaries is developed. In this way, the Efficiency is connected with Potential Theory by Optimization and conversely. Several pertinent references conclude this investigation
Wen Yi
Chinese Academy of Sciences, China
Title: A Big Data Knowledge Computing Platform for Intelligence Studies
Biography:
Wen Yi, professor of Chengdu Library and Information Center, Chinese Academy of Sciences, holds a Master’s degree in Information Science from Sichuan University. He specialized in big data analysis and knowledge discovery information system and has published more than 30 papers about these fields. He is the head of the project-“the construction of Intellectual Property Network of CAS” and several other projects. His research has gained the “Sichuan Province Science and Technology Progress Third Award
Abstract:
Intelligence studies is a method of using modern information technology and soft science research methods to form valuable information products by collecting, selecting, evaluating and synthesizing information resources. With the advent of the era of big data, the core work of information analysis with data is facing enormous opportunities and challenges. How to make good use of big data in an effort to solve the problem of big data, optimize and improve the traditional intelligence studies methods and tools, innovation and research based on big data are the key issues that need to be studied and solved in current intelligence studies work.
Through the analysis of intelligence studies methods and common tools under the background of big data, we sort out the processes and requirements of the intelligence studies work under big data environment, design and implement a universal knowledge computing platform for intelligence studies, which enables intelligence analysts to easily use all kinds of big data analysis algorithms without writing programs (http://www.zhiyun.ac.cn). Our platform is built upon the open source big data system of Hadoop and Spark. All the data are stored in the distributed file system HDFS and data management system of Hive. All of the computational resources are managed with Yarn and each of the submitted task is scheduled with the workflow scheduler system Oozie. The core of the platform consists of three modules: data management, data calculation and data visualization.
The data management module is used to store and manage the relevant data of intelligence studies, which consists of four parts: metadata management, data connection, data integration and data management. The platform supports the import and management of multi-source heterogeneous data, including papers, patents from ISI, PubMed, etc., and also supports the data import with API of MySQL, Hive and other database systems. The platform has more than 20 kinds of data cleaning and updating rules, such as search and replace, regular cleaning, null filling, etc., and also supports users to customize and edit the cleaning rules.
The data calculation module is used to store and manage the big data analysis algorithm and intelligence analysis process, and provides a user-friendly GUI for users to create customized intelligence analysis process, and the packaged process can be submitted to the platform for calculation and obtain the calculation results of each step. In the system, a task is formulated as a directed acyclic graph (DAG) in which the source data flows into the root nodes. Each node makes operations on the data, generates new data, and sends the Intelligence studies is a method of using modern information technology and soft science research methods to form valuable information products by collecting, selecting, evaluating and synthesizing information resources. With the advent of the era of big data, the core work of information analysis with data is facing enormous opportunities and challenges. How to make good use of big data in an effort to solve the problem of big data, optimize and improve the traditional intelligence studies methods and tools, innovation and research based on big data are the key issues that need to be studied and solved in current intelligence studies work.
Through the analysis of intelligence studies methods and common tools under the background of big data, we sort out the processes and requirements of the intelligence studies work under big data environment, design and implement a universal knowledge computing platform for intelligence studies, which enables intelligence analysts to easily use all kinds of big data analysis algorithms without writing programs (http://www.zhiyun.ac.cn). Our platform is built upon the open source big data system of Hadoop and Spark. All the data are stored in the distributed file system HDFS and data management system of Hive. All of the computational resources are managed with Yarn and each of the submitted task is scheduled with the workflow scheduler system Oozie. The core of the platform consists of three modules: data management, data calculation and data visualization.
The data management module is used to store and manage the relevant data of intelligence studies, which consists of four parts: metadata management, data connection, data integration and data management. The platform supports the import and management of multi-source heterogeneous data, including papers, patents from ISI, PubMed, etc., and also supports the data import with API of MySQL, Hive and other database systems. The platform has more than 20 kinds of data cleaning and updating rules, such as search and replace, regular cleaning, null filling, etc., and also supports users to customize and edit the cleaning rules.
The data calculation module is used to store and manage the big data analysis algorithm and intelligence analysis process, and provides a user-friendly GUI for users to create customized intelligence analysis process, and the packaged process can be submitted to the platform for calculation and obtain the calculation results of each step. In the system, a task is formulated as a directed acyclic graph (DAG) in which the source data flows into the root nodes. Each node makes operations on the data, generates new data, and sends the generated data to its descendant nodes for conducting further operations. Finally, the results flow out from the leaf nodes.
The data visualization module is used to visualize the results of intelligence analysis and calculation, including more than ten kinds of visualization charts such as line chart, histogram chart, radar chart and word cloud chart.
Practice has proved that the platform can well meet the requirements of intelligence studies in various fields in the era of big data, and promote the application of data mining and knowledge discovery in the field of intelligence studies.
Masoud Barati
School of Computer Science and Informatic, Cardiff University, Cardiff, UK
Title: Using Blockchain for Verifying GDPR Rules in Cloud Ecosystems
Biography:
Masoud Barati is currently a postdoctoral research associate in the school of computer science and informatics of Cardiff university, who started his job in Nov 2018. He involved the privacy-aware cloud ecosystems (PACE) project utilizing GDPR and blockchain technology to enhance user privacy in cloud computing. He received his PhD in computer science from Sherbrooke university in Canada in May 2018. His PhD thesis was about the orchestration of dynamic software components using behavior composition framework. He has more than 20 manuscripts published in the well-known conferences and journals and is the reviewer of ICIW conferences and IEEE transactions on service computing journal. Moreover, he was a faculty member in the department of computer engineering of Azad university in Iran from Sep 2011 to Dec 2014. His research interests are service composition, distributed systems, blockchain, formal methods, verification, and ontology.
Abstract:
Understanding how cloud providers support the European General Data Protection Regulation (GDPR) remains a main challenge for new providers emerging on the market. GDPR influences access to, storage, processing and transmission of data, requiring these operations to be exposed to a user to seek explicit consent. A privacy-aware cloud architecture is proposed that improves transparency and enables the audit trail of providers who accessed the user data to be recorded. The architecture not only supports GDPR compliance by imposing several data protection requirements on cloud providers, but also beneï¬ts from a blockchain network that securely stores the providers’ operations on the user data. A blockchain-based tracking approach based on a shared privacy agreement implemented as a smart contract is described – providers who violate GDPR rules are automatically reported
Ma Chao
Peking Union Medical College Hospital, China
Title: Establishing thresholds and effects of gender, age, and season for thyroglobulin and thyroid peroxidase antibodies by mining real-world big data
Biography:
Ma Chao is affiliated to Department of Clinical Laboratory, Peking Union Medical College Hospital, Peking Union Medical College & Chinese Academy of Medical Science, Beijing 100730, P.R. China.
Abstract:
Background: Thyroglobulin antibody (TG-Ab) and thyroid peroxidase antibody (TPO-Ab) are cornerstone biomarkers for autoimmune thyroid diseases, and establishment of appropriate thresholds is crucial for physicians to appropriately interpret test results. Therefore, we established the thresholds of TG-Ab and TPO-Ab in the Chinese population through analysis of real-world big data, and explored the influence of age, gender, and seasonal factors on their levels.
Methods: The data of 35,869 subjects downloaded from electronic health records were analyzed after filtering based on exclusion criteria and outliers. The influence of each factor on antibody levels was analyzed by stratification. Thresholds of TG-Ab and TPO-Ab were established through Clinical Laboratory Standards Institute document C28-A3 and National Academy of Clinical Biochemistry (NACB) guidelines, respectively.
Results: There were significant differences according to gender after age stratification; the level of TG-Ab gradually increased with age in females. There were significant differences in TG-Ab and TPO-Ab distributions with respect to age after gender stratification. Moreover, differences were observed between seasons for TG-Ab and TPO-Ab. The thresholds of TG-Ab and TPO-Ab were 107 [90% confidence interval (CI):101–115] IU/mL and 29 (90% CI: 28–30) IU/mL, respectively, using C28-A3 guidelines, but were 84 (90% l CI: 50–126) IU/mL and 29 (90% CI: 27–34) IU/mL, respectively, using NACB guidelines.
Conclusion: The levels of TG-Ab and TPO-Ab were significantly affected by gender, age, and season. The thresholds for TG-Ab and TPO-Ab for the Chinese population. were established by big data analysis
Biography:
I am data scientist and digital transformation enthusiast, work as a manager data scientist at Aurubis AG, Europe biggest multimetal producer and #3 in the world. I am empirical econometrician, have PhD in econometrics (German Dr.rer.pol) and before I changed the job profile into the industry, I was empirical researcher for 10 years with focus on forecasting and uncertainty. My expertise not only in theoretical concepts but, more importantly, wide application of them for practical business cases. Being the operational data science manager, I have experience both in project management of data-driven projects, and actual realization, from assessment of a business case and user story to the deployment. The portfolio of my projects is very broad - from KPI (and other reports) assessment, development and deployment projects, to deep learning model and AI development and implementation for relevant business cases.
Abstract:
About 3 years ago, my boss decided that it’s time to leverage the superpowers of data. So, I was the first data scientist, a unicorn, amongst 6600 colleges at Aurubis. The primary task was to introduce, to explain, promote and establish data science skillset within the organization. Old industry, like metallurgy and mining, are not the typical examples of successful digital transformation because the related business models are extremely stable, even in the era of hyper-innovation. At least this is what some people believe, and it’s partly true, because for some branches, there is no burning platform for digitization, and hence, the change process is inert. Data science is the fundamental component of digital transformation. Our contribution to the change has a huge impact because we can extract the value from the data and generate the business value, to show people what can be done when the data is there and valid.
I learned that most valuable, essential skills to succeed in our business are not necessarily programming and statistics. We all have training on data science methods at its best. The two must have skills are resilience and communication. Whenever you start something new, you will fail. You must be and stay resilient to rise strongly. Moreover, in the business world is the ability to communicate - tell data-based stories, to visualize and to promote them is crucial. As a data scientist you can only be as good as your communications skills are, since you need to persuade others to make decisions or help to build products based on your analyses. Finally, dare to start simple. When you introduce data science in the industry, you start on the brown field. Simple use cases and projects like metrics, dashboards, reports, historical analysis help you to understand the business model and to assess where is your contribution to success of the company. This is the key to data science success, not only in the multimetal but everywhere else as well.
Mahboobeh Zohourian
Hormozghan University Iran
Title: Discovering the Dropout Situations Using Statistical and Machine Learning Models
Biography:
Mahboobeh Zohourian Moftakhar Ahmadihas completed his Master of Statistics at the age of 27 years from Ferdowsi University. She is the Lecturer of the Education Office of Mashhad. She has published 1 paper in reputed conference
Abstract:
Dropping out of university is one of the serious issues of higher education in the public sector and in the private sector, notably in non-profit universities where the students should pay tuition fee. Moreover, in the state universities, where the Ministry of Science, Research and Technology pays the per capita for each student, it imposes economic losses on the government and the higher education system. This study aims to determine and classify the factors influencing student dropout using statistical and machine learning models and then identify and predict the dropout situations. To this end, Hormozgan University Educational System Database containing information on 6915 students at different educational levels between 2011 and 2015 was used. The data were analyzed using statistical learning models such as decision tree (base decision tree, random forest model, and boosting method), logistic regression and machine learning models such as neural network and support vector machine
Zolo Kiala
University of KwaZulu-Natal, Pietermaritzburg, South Africa
Title: Automated classification of a tropical landscape infested by Parthenium weed (Parthenium hyterophorus)
Biography:
Zolo Kiala received the M.Sc. degree (cum laude) in Science from the University of KwaZulu-Natal, Pietermaritzburg, South Africa, and is currently pursuing the Ph.D. degree specializing in mapping and monitoring invasive alien plants. His research interests include hyper and multispectral remote sensing applications in range land ecology, natural vegetation and field crops
Abstract:
The invasive Parthenium weed (Parthenium hyterophorus) adversely affects animal and human health, agricultural productivity, rural livelihoods, local and national economies, and the environment. Its fast spreading capability requires consistent monitoring for adoption of relevant mitigation approaches, potentially through remote sensing. To date, studies that have endeavoured to map the Parthenium weed have commonly used popular classification algorithms that include Support vector machines and Random forest classifiers, which do not capture the complex structural characteristics of the weed. Furthermore, determination of site or data specific algorithms, often achieved through intensive comparison of algorithms, is often laborious and time consuming. Also, selected algorithms may not be optimal on datasets collected in other sites. Hence, this study adopted the Tree-based Pipeline Optimization Tool (TPOT), an automated machine learning approach that can be used to overcome high data variability during the classification process. Using Sentinel-2 and Landsat 8 imagery to map Parthenium weed, wee compared the outcome of the TPOT to the best performing and optimized algorithm selected from sixteen classifiers on different training datasets. Results showed that the TPOT model yielded a higher overall classification accuracy (88.15%) using Sentinel-2 and 74 % using Landsat 8, accuracies that were higher than the commonly used robust classifiers. This study is the first to demonstrate the value of TPOT in mapping Parthenium weed infestations using satellite imagery. Its adoption would therefore be useful in limiting human intervention while optimising classification accuracies for mapping invasive plants. Based on these findings, we propose TPOT as an efficient method for selecting and tuning algorithms for Parthenium discrimination and monitoring, and indeed general vegetation mapping.
Biography:
Abstract:
Yuefeng Li
Queensland University of Technology, Australia
Title: AI-based data analysis for text classification and document summarization
Biography:
Yuefeng Li is a Professor of Queensland University of Technology (QUT), Australia. Since the award of his PhD in 2001, he has made significant contributions to Data Mining and Text Mining, published over 200 refereed papers, including seven best paper awards. He has demonstrable experience in leading large-scale research projects and has achieved many established research outcomes that have been highly cited in some top Journals (e.g., TKDE, DMKD) and International Conferences (e.g., KDD, ICIS, CIKM, ICDM, WWW and Hypertext). His total number of Google citations = 4729 and h-index = 33. He is the Editor-in-Chief of Web Intelligence journal.
Abstract:
over the years, businesses have collected very large and complex big data collections, and it has become increasingly difficult to process these big data using the tradition techniques. There is a big challenging issue since the majority of big data is unlabelled in unstructured (information that is not pre-defined) manner. Recently, AI (Artificial Intelligence) based techniques have been used to solve this big issue, e.g., understanding a firm’s reputation using on-line customer reviews, or retrieving of training samples from unlabelled tweets and so on. This talk discusses how AI techniques contribute to text classification and document summarization in the case of only obtaining limited user feedback information for relevance. It firstly discusses the principle of a new classification methodology “a three-way decision based binary classification” to understand the hard issue for dealing with the uncertain boundary between the positive class and negative class. It also extended the application of three-way decisions for text classification to document summarization and sentiment analysis. This talk will presents some new experimental results on several popular data collections, such as RCV1, Reuters-21578, Tweets2011 and Tweets2013, DUC 2006 and 2007, and Amazon review data collections. It also discusses many advanced techniques for obtain more knowledge from big data about the relevance in order to help people to create effective machine learning systems for processing big data, and several open issues regarding to AI-based data analysis for text, Web and media data.