This paper proposes a novel approach for identifying software bugs by building on a meaningful combination of word embeddings, graph-based text representations and graph attention networks. Existing approaches aim to advance each of the above components individually, without considering an integrative approach. As a result, they ignore information that is related to either the structure of a given text or an individual word of the text. Instead, our approach seamlessly incorporates both semantic and structural characteristics into a graph, which are then fed to a graph attention network in order to classify GitHub issues as bugs or features. Our experimental results demonstrate a significant improvement in terms of accuracy, precision and recall of the proposed approach compared to a list of classical and graph-based machine learning models. The dataset for the experiments reported in this paper has been retrieved from the kaggle.com platform and concerns GitHub issues with short-text attributes.
Conference
Predicting prices of Airbnb listings via Graph Neural Networks and Document Embeddings: The case of the island of Santorini
Nikos Kanakaris, and Nikos Karacapilidis
International Conference on ENTERprise Information Systems (CENTERIS) (to appear) 2022
We propose a new approach for predicting prices of Airbnb listings for touristic destinations such as the island of Santorini using graph neural networks and document embeddings. The already existing methods rely only on the features of each individual listing, ignoring any topological or neighborhood properties. Our approach represents the listings of a given area as a graph, where each node corresponds to a listing and each edge connects two similar neighboring listings. This enables us to not only exploit the features of each individual listing, but also to take into consideration information related to its neighborhood. Our preliminary experiments demonstrate that the proposed approach outperforms a list of classical regression models as far as the coefficient of determination (R2) is concerned and decreases the Mean Squared Error (MSE). The data of the experimentations reported in this paper have been retrieved from the insideairbnb.com platform and describe the Airbnb listings of the island of Santorini.
Book Chapter
A comparative survey of graph databases and software for social network analytics: The link prediction perspective
Nikos Kanakaris, Dimitrios Michail, and Iraklis Varlamis
Book chapter for Graph Databases and their use in social media and smart cities, Science Publishers and CRC Press, Taylor and Francis Group (to appear) 2022
In recent years, we have witnessed an excessive increase in the amounts of data available on the Web. These data originate mostly from social media applications or social networks and thus they are highly connected. Graph databases are capable of managing these data successfully since they are particularly designed for storing, retrieving, and searching data that is rich in relationships. This chapter aims to provide a detailed literature review of the existing graph databases and software libraries suitable for performing common social network analytic tasks. In addition, a classification of these graph technologies is also proposed, taking into consideration (i) the provided means of storing, importing, exporting, and querying data, (ii) the available algorithms, (iii) the ability to deal with big social graphs, and (iv) the CPU and memory usage of each one of the reported
Journal
Detection of fake news campaigns using graph convolutional networks
Dimitrios Michail, Nikos Kanakaris, and Iraklis Varlamis
International Journal of Information Management Data Insights, Elsevier 2022
The detection of organised disinformation campaigns that spread fake news, by first camouflaging them as real ones is crucial in the battle against misinformation and disinformation in social media. This article presents a method for classifying the diffusion graphs of news formed in social media, by taking into account the profiles of the users that participate in the graph, the profiles of their social relations and the way the news spread, ignoring the actual text content of the news or the messages that spread it. This increases the robustness of the method and widens its applicability in different contexts. The results of this study show that the proposed method outperforms methods that rely on textual information only and provide a model that can be employed for detecting similar disinformation campaigns on different context in the same social medium.
Journal
Making personnel selection smarter through word embeddings: A graph-based approach
Nikos Kanakaris, Nikolaos Giarelis, Ilias Siachos, and Nikos Karacapilidis
This paper employs techniques and algorithms from the fields of natural language processing, graph representation learning and word embeddings to assist project managers in the task of personnel selection. To do so, our approach initially represents multiple textual documents as a single graph. Then, it computes word embeddings through representation learning on graphs and performs feature selection. Finally, it builds a classification model that is able to estimate how qualified a candidate employee is to work on a given task, taking as input only the descriptions of the tasks and a list of word embeddings. Our approach differs from the existing ones in that it does not require the calculation of key performance indicators or any other form of structured data in order to operate properly. For our experiments, we retrieved data from the Jira issue tracking system of the Apache Software Foundation. The evaluation results show, in most cases, an increase of 0.43% in the accuracy of the proposed classification models when compared against a widely-adopted baseline method, while their validation loss is significantly decreased by 65.54%.
Book Chapter
Medical Knowledge Graphs in the Discovery of Future Research Collaborations
Nikolaos Giarelis, Nikos Kanakaris, and Nikos Karacapilidis
This chapter introduces a framework that is based on a novel graph-based text representation method and combines graph-based feature selection, text categorization and link prediction to advance the discovery of future research collaborations. Our approach integrates into a single knowledge graph both structured and unstructured textual data through a novel representation of multiple scientific documents. The Neo4j graph database is used for the representation of the proposed scientific knowledge graph. For the implementation of our approach, we use the Python programming language and the scikit-learn machine learning library. We assess our approach against classical link prediction algorithms using accuracy, recall and precision as our performance metrics. Our experiments achieve state-of-the-art accuracy in the task of predicting future research collaborations. The experimentations reported in this chapter use the COVID-19 Open Research Dataset.
2021
Journal
Converting Biomedical Text Annotated Resources into FAIR Research Objects with an Open Science Platform
Alexandros Kanterakis, Nikos Kanakaris, Manos Koutoulakis, Konstantina Pitianou, Nikos Karacapilidis, Lefteris Koumakis, and George Potamias
Today, there are excellent resources for the semantic annotation of biomedical text. These resources span from ontologies, tools for NLP, annotators, and web services. Most of these are available either in the form of open source components (i.e., MetaMap) or as web services that offer free access (i.e., Whatizit). In order to use these resources in automatic text annotation pipelines, researchers face significant technical challenges. For open-source tools, the challenges include the setting up of the computational environment, the resolution of dependencies, as well as the compilation and installation of the software. For web services, the challenge is implementing clients to undertake communication with the respective web APIs. Even resources that are available as Docker containers (i.e., NCBO annotator) require significant technical skills for installation and setup. This work deals with the task of creating ready-to-install and run Research Objects (ROs) for a large collection of components in biomedical text analysis. These components include (a) tools such as cTAKES, NOBLE Coder, MetaMap, NCBO annotator, BeCAS, and Neji; (b) ontologies from BioPortal, NCBI BioSystems, and Open Biomedical Ontologies; and (c) text corpora such as BC4GO, Mantra Gold Standard Corpus, and the COVID-19 Open Research Dataset. We make these resources available in OpenBio.eu, an open-science RO repository and workflow management system. All ROs can be searched, shared, edited, downloaded, commented on, and rated. We also demonstrate how one can easily connect these ROs to form a large variety of text annotation pipelines.
Journal
Shall I Work with Them? A Knowledge Graph-Based Approach for Predicting Future Research Collaborations
Nikos Kanakaris, Nikolaos Giarelis, Ilias Siachos, and Nikos Karacapilidis
We consider the prediction of future research collaborations as a link prediction problem applied on a scientific knowledge graph. To the best of our knowledge, this is the first work on the prediction of future research collaborations that combines structural and textual information of a scientific knowledge graph through a purposeful integration of graph algorithms and natural language processing techniques. Our work: (i) investigates whether the integration of unstructured textual data into a single knowledge graph affects the performance of a link prediction model, (ii) studies the effect of previously proposed graph kernels based approaches on the performance of an ML model, as far as the link prediction problem is concerned, and (iii) proposes a three-phase pipeline that enables the exploitation of structural and textual information, as well as of pre-trained word embeddings. We benchmark the proposed approach against classical link prediction algorithms using accuracy, recall, and precision as our performance metrics. Finally, we empirically test our approach through various feature combinations with respect to the link prediction problem. Our experimentations with the new COVID-19 Open Research Dataset demonstrate a significant improvement of the abovementioned performance metrics in the prediction of future research collaborations.
Conference
A Comparative Assessment of State-Of-The-Art Methods for Multilingual Unsupervised Keyphrase Extraction
Nikolaos Giarelis, Nikos Kanakaris, and Nikos Karacapilidis
In Artificial Intelligence Applications and Innovations 2021
Keyphrase extraction is a fundamental task in information management, which is often used as a preliminary step in various information retrieval and natural language processing tasks. The main contribution of this paper lies in providing a comparative assessment of prominent multilingual unsupervised keyphrase extraction methods that build on statistical (RAKE, YAKE), graph-based (TextRank, SingleRank) and deep learning (KeyBERT) methods. For the experimentations reported in this paper, we employ well-known datasets designed for keyphrase extraction from five different natural languages (English, French, Spanish, Portuguese and Polish). We use the F1 score and a partial match evaluation framework, aiming to investigate whether the number of terms of the documents and the language of each dataset affect the accuracy of the selected methods. Our experimental results reveal a set of insights about the suitability of the selected methods in texts of different sizes, as well as the performance of these methods in datasets of different languages.
2020
Conference
On the Utilization of Structural and Textual Information of a Scientific Knowledge Graph to Discover Future Research Collaborations: A Link Prediction Perspective
Nikolaos Giarelis, Nikos Kanakaris, and Nikos Karacapilidis
We consider the discovery of future research collaborations as a link prediction problem applied on scientific knowledge graphs. Our approach integrates into a single knowledge graph both structured and unstructured textual data through a novel representation of multiple scientific documents. The Neo4j graph database is used for the representation of the proposed scientific knowledge graph. For the implementation of our approach, we use the Python programming language and the scikit-learn ML library. We benchmark our approach against classical link prediction algorithms using accuracy, recall, and precision as our performance metrics. Our initial experimentations demonstrate a significant improvement of the accuracy of the future collaboration prediction task. The experimentations reported in this paper use the new COVID-19 Open Research Dataset.
Conference
An Innovative Graph-Based Approach to Advance Feature Selection from Multiple Textual Documents
Nikolaos Giarelis, Nikos Kanakaris, and Nikos Karacapilidis
In Artificial Intelligence Applications and Innovations 2020
This paper introduces a novel graph-based approach to select features from multiple textual documents. The proposed solution enables the investigation of the importance of a term into a whole corpus of documents by utilizing contemporary graph theory methods, such as community detection algorithms and node centrality measures. Compared to well-tried existing solutions, evaluation results show that the proposed approach increases the accuracy of most text classifiers employed and decreases the number of features required to achieve ‘state-of-the-art’ accuracy. Well-known datasets used for the experimentations reported in this paper include 20Newsgroups, LingSpam, Amazon Reviews and Reuters.
Conference
On a Novel Representation of Multiple Textual Documents in a Single Graph
Nikolaos Giarelis, Nikos Kanakaris, and Nikos Karacapilidis
This paper introduces a novel approach to represent multiple documents as a single graph, namely, the graph-of-docs model, together with an associated novel algorithm for text categorization. The proposed approach enables the investigation of the importance of a term into a whole corpus of documents and supports the inclusion of relationship edges between documents, thus enabling the calculation of important metrics as far as documents are concerned. Compared to well-tried existing solutions, our initial experimentations demonstrate a significant improvement of the accuracy of the text categorization process. For the experimentations reported in this paper, we used a well-known dataset containing about 19,000 documents organized in various subjects.
Conference
On the Exploitation of Textual Descriptions for a Better-informed Task Assignment Process
Nikos Kanakaris, Nikos Karacapilidis, and Georgios Kournetas
Project Management is a complex practice that is associated with a series of challenges such as handling of conflicts and dependencies in resource allocation, fine tuning of projects to avoid fragmented planning, handling of potential opportunities or threats during the execution of a project, and alignment between projects and business objectives. Traditionally, methods and tools to address these issues are based on analytical approaches developed in the realm of the Operations Research discipline. Aiming to facilitate and augment the quality of the Project Management practice, this paper proposes a hybrid approach that builds on the synergy between contemporary Machine Learning and Operations Research techniques. Based on past data, Machine Learning techniques can predict undesired situations, provide timely warnings and recommend preventive actions regarding problematic resource loads or deviations from business priority lists. The applicability of our approach is demonstrated through two real examples elaborating two different datasets. In these examples, we comment on the proper orchestration of the associated Operations Research and Machine Learning algorithms, paying equal attention to both optimization and big data manipulation issues.
2019
Conference
Towards Reproducible Bioinformatics: The OpenBio-C Scientific Workflow Environment
A. Kanterakis, G. Iatraki, K. Pityanou, L. Koumakis, N. Kanakaris, N. Karacapilidis, and G. Potamias
In 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE) Oct 2019
Nowadays, the plant life all over the world is decreasing rapidly. On the contrary, the breakouts of fires, in forests, are increasing. As a result, many programs and missions have been created, as far as remote sensing is concerned. Their goal is to collect data, in order to map out burnt areas. NASA’s Landsat Program provides information that can be used in terms of ’burn-scar’ mapping. ESA (European Space Agency) also offers its services via the program ’Copernicus’ which is responsible for satellite missions called ’Sentinels’. The National Observatory of Athens has played a significant role in mapping out burnt areas throughout Greek territory, by developing different systems, which apply algorithms and filters on digital image processing. The data used is mostly from the Landsat Program, due to the fact its’ satellite images are high quality and large in size. Therefore, the need for processing a large number of images that are also big in size, makes sequential implementation of the system not sufficient enough, as far as time is concerned. This thesis has worked on the algorithm and filter parallelization of the ’burn-scar’ mapping system that was implemented by the National Observatory of Athens. For the actualization of the thesis, the programming language Python and Message Passing Interface (MPI) are used. The parallelization achieves a decrease of the total execution time from 14,7 minutes to less than 1 minute, using up to 33 quad-core computers at the lab of the Department of Informatics and Telematics.