📀 Luca Coraggio

I am a CSEF Fellow and a Postdoctoral Researcher at the Department of Economics and Statistics, University of Naples Federico II (Italy). I hold a Ph.D. in Economics, and my main research interests are in Machine Learning and Statistics.

In these years I've been working on two connected lines of research. The first one is on methodological statistics, with a particular focus on model-based clustering and criteria for selecting optimal clustering solutions. The second line of research is devoted to application of supervised and unsupervised learning methods to general problems in Economics, using state-of-the-art statistical methods (involving: standard ML tools, deep learning, NLP, and computer vision) to exploit new sources of data, like images and text.

I enjoy coding my own solutions, and I am fluent in several programming languages. Here is my current top-three: C, Python, R.

Preference learning

Bayesian genome-wide clustering and variable selection of transcriptomic data via rank-based mixtures.

This is joint work together with Prof Valeria Vitelli and other amazing researchers from the University of Oslo, Norway. In this work, we extend the Bayesian Mallows Model to handle clustering in ultra-high-dimensional settings. An arXive of the paper will come soon.
Mirkin distance and cluster validation

Together with Prof Boris Mirkin (HSE) and Prof Antonio D'Ambrosio (Federico II), we are studying the asymptotic properties of the Mirkin distance between partitions, and devising applications of our new results to cluster analysis and cluster validation.

Skill Mismatch and Job misallocation

Coraggio, L., Langella, M., Miano, A., Pagano, M., Petterson, M. S., Pezone, V., & Scognamiglio, A. (2024). Mismatch in the 21st century: an overview. CSEF.

Selected publications (full list)

(2025) JAQ of all trades: Job mismatch, firm productivity and managerial quality.
Journal of Financial Economics. -- w/ Marco Pagano and Annalisa Scognamiglio and Joacim Tåg

We develop a novel measure of job-worker allocation quality (JAQ) by exploiting employer-employee data with machine learning techniques. Based on our measure, the quality of job-worker matching correlates positively with individual labor earnings and firm productivity, as well as with market competition, non-family firm status, and employees’ human capital. Management plays a key role in job-worker matching: when managerial hirings and firings persistently raise management quality, the matching of rank-and-file workers to their jobs improves. JAQ can be constructed from any employer–employee data set including workers’ occupations, and used to explore research questions in corporate finance and organization economics.

@Article{CoraggioEtAl2025JoFE, author = {Luca Coraggio and Marco Pagano and Annalisa Scognamiglio and Joacim Tåg}, journal = {Journal of Financial Economics}, title = {JAQ of all trades: Job mismatch, firm productivity and managerial quality}, year = {2025}, issn = {0304-405X}, pages = {103992}, volume = {164}, abstract = {We develop a novel measure of job-worker allocation quality (JAQ) by exploiting employer-employee data with machine learning techniques. Based on our measure, the quality of job-worker matching correlates positively with individual labor earnings and firm productivity, as well as with market competition, non-family firm status, and employees’ human capital. Management plays a key role in job-worker matching: when managerial hirings and firings persistently raise management quality, the matching of rank-and-file workers to their jobs improves. JAQ can be constructed from any employer–employee data set including workers’ occupations, and used to explore research questions in corporate finance and organization economics.}, doi = {10.1016/j.jfineco.2024.103992}, keywords = {Jobs, Workers, Matching, Mismatch, Machine learning, Productivity, Management}, url = {https://www.sciencedirect.com/science/article/pii/S0304405X24002150}, }
(2024) Asymptotic Results for the Estimation of the Quadratic Score of a Clustering.
Mathematics. -- w/ Pietro Coretto

In cluster analysis one often finds several partitions of a data set using different clustering methods and algorithms set with a variety of hyperparameters and tunings. The number of clusters K is one of the most relevant of such hyperparameters. Cluster selection is the task of choosing the desired partitions. The Bootstrap Quadratic Scoring is a recently introduced method where the cluster selection is performed by optimizing a score attached to a partition that is based on the quadratic discriminant function. Previously, we proposed the estimation of this cluster score via bootstrap resampling and investigated the proposed estimator based on numerical experiments and real data applications. However, that earlier work did not provide theoretical guarantees. In this paper, we fill that gap. We study the asymptotic behavior of the scoring method and show that the proposed estimator converges to well-defined population counterparts.

@Article{CoraggioCoretto2024M, author = {Coraggio, Luca and Coretto, Pietro}, journal = {Mathematics}, title = {Asymptotic Results for the Estimation of the Quadratic Score of a Clustering}, year = {2024}, issn = {2227-7390}, number = {21}, volume = {12}, abstract = {In cluster analysis one often finds several partitions of a data set using different clustering methods and algorithms set with a variety of hyperparameters and tunings. The number of clusters K is one of the most relevant of such hyperparameters. Cluster selection is the task of choosing the desired partitions. The Bootstrap Quadratic Scoring is a recently introduced method where the cluster selection is performed by optimizing a score attached to a partition that is based on the quadratic discriminant function. Previously, we proposed the estimation of this cluster score via bootstrap resampling and investigated the proposed estimator based on numerical experiments and real data applications. However, that earlier work did not provide theoretical guarantees. In this paper, we fill that gap. We study the asymptotic behavior of the scoring method and show that the proposed estimator converges to well-defined population counterparts.}, doi = {10.3390/math12213417}, keywords = {cluster validation; model-selection; method-selection; resampling methods; asymptotic analysis}, url = {https://www.mdpi.com/2227-7390/12/21/3417}, }
(2023) Selecting the number of clusters, clustering models, and algorithms. A unifying approach based on the quadratic discriminant score.
Journal of Multivariate Analysis. -- w/ Pietro Coretto

Cluster analysis requires fixing the number of clusters and often many hyper-parameters. In practice, one produces several partitions, and a final one is chosen based on validation or selection criteria. There exist an abundance of validation methods that, implicitly or explicitly, assume a certain clustering notion. In this paper, we focus on groups that can be well separated by quadratic or linear boundaries. The reference cluster concept is defined through the quadratic discriminant function and parameters describing clusters’ size, center and scatter. We develop two cluster-quality criteria that are consistent with groups generated from a class of elliptic–symmetric distributions. Using the bootstrap resampling of the proposed criteria, we propose a selection rule that allows choosing among many clustering solutions, eventually obtained from different methods. Extensive experimental analysis shows that the proposed methodology achieves a better overall performance compared to established alternatives from the literature.

@Article{CoraggioCoretto2023JoMA, author = {Luca Coraggio and Pietro Coretto}, journal = {Journal of Multivariate Analysis}, title = {Selecting the number of clusters, clustering models, and algorithms. A unifying approach based on the quadratic discriminant score}, year = {2023}, month = jul, pages = {105181}, volume = {196}, doi = {10.1016/j.jmva.2023.105181}, publisher = {Elsevier {BV}}, }
(2021) Illicit drugs seizures in 2013–2018 and characteristics of the illicit market within the Neapolitan area.
Forensic Science International. -- w/ Silvestre, A. and Basilicata, P. and Guadagni, R. and Simonelli, A. and Pieri, M.

The study presents results of toxicological analysis performed on seized material in Neapolitan area in the period from 2013 to 2018. A constancy in THC and heroin percentages is evidenced (%THC ~10% and ~11.5% for marijuana and hashish; heroine: 20–24%), with mean values exceeding the European data. Data on cocaine revealed a constant increment of active principle percentage over the studied period (from 40% in 2013 to ~65% in 2018), with peak of 70% in 2017; also, number of samples exceeding the mean value increased over years. Active principles contents resulted higher than the ones reported in other Italian area ever the same period; marijuana was prevalent on hashish, confirming an Italian trend different from other European countries. A map of the Campania region evidenced two main “storage” districts, one corresponding to the city center and the second located in the northern part. If compared with literature data on the presence of local mafia, these areas are perfectly superimposable to those with the highest risk of homicides, thus confirming the degree of radicalization of local organizations and the relative weight of proceeds from drugs sale. Moreover, such radicalization within the territory seems to be the main reason of the absence of new psychoactive substances among the seized material.

@Article{SilvestreEtAl2021FSI, author = {Silvestre, A. and Basilicata, P. and Coraggio, L. and Guadagni, R. and Simonelli, A. and Pieri, M.}, journal = {Forensic Science International}, title = {Illicit drugs seizures in 2013–2018 and characteristics of the illicit market within the Neapolitan area}, year = {2021}, issn = {0379-0738}, month = apr, pages = {110738}, volume = {321}, doi = {10.1016/j.forsciint.2021.110738}, publisher = {Elsevier BV}, }

QCLUSTER (CRAN).
Performs tuning of clustering models, methods and algorithms including the problem of determining an appropriate number of clusters. Validation of cluster analysis results is performed via quadratic scoring using resampling methods, as in Coraggio, L. and Coretto, P. (2023) .

QCLUSTER is a package for the R programming language. Its core is entirely written in C, leveraging BLAS and LAPACK.

Citation info

Screenshots
RSC (CRAN).
Performs robust and sparse correlation matrix estimation. Robustness is achieved based on a simple robust pairwise correlation estimator, while sparsity is obtained based on thresholding. The optimal thresholding is tuned via cross-validation. See Serra, Coretto, Fratello and Tagliaferri (2018) .

RSC uses the quicksort algorithm by default to compute the median value of a vector. This is entirely implemented in C. Intorsort algorithm is also available, but may be not optimally tuned.

Parallel computing is implemented in R with the doParallel package. The original version used parallel processing in Fortran's co-arrays. Later the implementation of the quicksort function was rewritten in C using pointer arithmetics to improve efficiency.

Citation info

PythonLab

A short, introductory course in Python programming. Level: undergraduate, 6 lectures, ~2h/lecture. The course is modeled on the Python tutorial, with a tilt toward data analysis and economics. The material is in italian, and currently available on Moodle Unina (I plan to make it available on git), and it includes:

Slide with programming concepts
Exercises and solutions
Scripts and advanced solved exercises

Tools for Data Analysis

MOOC course, hosted on Federica Web Learning, part of the Labor, Development & Policy evaluation program. Available here: link.

The course introduces elements of programming and statistical methods for data analysis and data science. It reviews and uses R, and Python programming languages as well as shell scripting, to interact with data (visualization, manipulation), automate tasks (web scraping, file management) and deploy machine learning methods. The course is hands-on: students get to work on mini-projects and practical exercises throughout the lectures; theory of the methods is touched upon and references for self-study are provided. The course is aimed at students willing to acquire programming skills to work with data (ideally, they have already taken statistics courses).

Ciao! I am
Luca Coraggio

WP & Ongoing

Publications

Software

Preference learning

Skill Mismatch and Job misallocation

PythonLab

Tools for Data Analysis

Ciao! I amLuca Coraggio

WP & Ongoing

Publications

Software

Preference learning

Skill Mismatch and Job misallocation

PythonLab

Tools for Data Analysis

Ciao! I am
Luca Coraggio