Luca Canali homepage

Luca.Canali@cern.ch - @LucaCanaliDB - Home Page

Presentations, Talks and Videos:

Spark Performance Lab and Tools: sparkMeasure demo, TPCDS-PySpark demo, Spark-Dashboard demo
Introduction to Apache Spark APIs for Data Processing, training course on Apache Spark, November 2022, PDFs_and_Videos, Notebooks
Basic Physics Analyses Implemented Using Apache Spark, PyHEP 2022, September 14^th, 2022, pptx, PDF, PDF_extended, Notebooks, Video

Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins, Data+AI Summit 2021, May 26^th, 2021, pptx, PDF, Video, Demo (mp4)
What is New with Apache Spark Performance Monitoring in Spark 3.0, Data+AI Summit Europe 2020, November 18^th, 2020, pptx, PDF, Video
Big Data Tools and Pipelines for Machine Learning in HEP, CERN EP-IT Data science seminar, December 4^th, 2019, pptx, PDF
Performance Troubleshooting Using Apache Spark Metrics, Spark Summit Europe 2019, Amsterdam, October 17^th, 2019, pptx, PDF, Video
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distributed Keras on Analytics Zoo, Spark Summit Europe 2019, Amsterdam, October 16^th, 2019, pptx, PDF, Video
Big Data In HEP - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark, IXPUG Annual Conference 2019, CERN September 24^th, 2019, pptx, PDF
Apache Spark for RDBMS Practitioners, Spark Summit Europe 2018, London, October 4^th, 2018, pptx, PDF, Video
Data Analytics – Use Cases, Platforms, Services @ CERN IT, ITMM Meeting, CERN, March 5^th, 2018, pptx, PDF
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Methods, Spark Summit Europe 2017, Dublin, October 26^th, 2017, pptx, PDF, Video
Overview of Big Data Solutions and Services at CERN, CERN Knowledge Transfer Forum, CERN, September 29^th, 2017, slides: pptx, PDF
Hadoop and Spark Ecosystem for Data Analytics, Experience and Outlook, WLCG GDB meeting, CERN, September 13^th, 2017, slides: pptx, PDF
Data Analytics and CERN IT Hadoop Service, CERN openlab Technical Workshop, CERN, December 9^th, 2016, slides pptx, PDF
Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs, Spark Summit Europe 2016, Brussels, October 26^th, 2016, slides: pptx, PDF, Video
Integration of Oracle and Hadoop: hybrid databases affordable at scale, CHEP 2016, San Francisco, October 11^th, 2016, slides: pptx, PDF
Stack Traces and Flame Graphs for Oracle Troubleshooting, UKOUG Tech15 Super Sunday, Birmingham, December 6^th, 2015, slides: pptx, PDF
Modern Linux Tools for Oracle Troubleshooting, Swiss Oracle User Group (SOUG) event, Prangins (CH), May 21^st, 2015, slides PDF
Database Services During Run 2, WLCG Collaboration Workshop, Okinawa (JP), April 11^th, 2015, slides pptx
Modern Linux Tools for Oracle Troubleshooting, UKOUG Tech14, Liverpool, December 9^th, 2014, slides pptx, PDF
A Closer Look at CALIBRATE_IO, UKOUG Tech14, Liverpool, December 9th, 2014, slides pptx, PDF
Introduction on Data for Physics at CERN and Deep Dive into Oracle ASM, Enkitec E4 2014, Dallas (TX), June 2014, pptx
A Latency Picture is Worth a Thousand Storage Metrics, Hotsos 2014, Dallas (TX), March 4th, 2014, pptx, pdf
Lost Writes, a DBA's Nightmare?, UKOUG Tech13, Manchester, December 4th, 2013, pptx
Storage Latency for Oracle DBAs, UKOUG Tech13, Manchester, December 2nd, 2013, pptx
Active Data Guard at CERN, UKOUG Conference 2012, Birmingham, December 4th, 2012, pptx
Testing Storage for Oracle RAC 11g with NAS, ASM, and SSD Flash Cache, UKOUG Conference 2011, Birmingham, December 6th, 2011, pptx
CERN IT-DB Deployment, Status, Outlook, ESA-GAIA DB Workshop, ISDC, Geneva, March 2011, pptx
Click here for a list including talks prior to 2011

Repositories at https://github.com/LucaCanali

SparkMeasure

A tool for performance troubleshooting of Apache Spark workloads.

SparkTraining

Training material for course "Introduction to Apache Spark APIs for Data Processing": https://sparktraining.web.cern.ch/

Spark Performance Dashboard

Notes and code for deploying an Apache Spark performance dashboard using container technology (Dockerfile and Helm chart).

SparkPlugins

Code and examples of how to use Spark Plugin extensions with Apache Spark 3.0 to extend the Spark metrics systems with custom monitoring probes for OS, I/O and external applications.

SparkDLTrigger

Code, notebooks, and links to the datasets accompanying the article Machine Learning Pipelines with Modern Big DataTools for High Energy Physics

Miscellaneous

Notes on Apache Spark, with tips and techniques on and around using Spark
Spark for Physics, Jupyter notebooks with examples of High Energy Physics analyses using Spark
SparkHistograms, Python and Scala packages for generating histograms with Spark
Performance testing, notes, scripts and resources dedicated on load testing and performance measurements, includes

Jupyter notebooks with examples of to read from Oracle, Trino/Presto, PostgreSQL, YugabyteDB

Notebook Examples

Example notebooks for Deep Learning, Data Tools, and AI Tools.

Linux tracing scripts

Scripts and tools for troubleshooting and performance analysis in Linux.

PerfSheet4

PerfSheet4 is a tool to query and visualize Oracle AWR data using Excel pivot charts

PerfSheet.js

PerfSheet.js is a tool to extract and visualize Oracle AWR time series data in the browser using JavaScript and dynamic pivot charts.

PyLatencyMap

PyLatencyMap is a tool for heat map visualization on the CLI.

Stack Profiling

Tools and scripts for stack profiling: Userspace, Kernel, OS state and optionally Oracle wait events.

Oracle DBA scripts

A collection of DBA scripts for old-school CLI Oracle troubleshooting and performance monitoring.

OraLatencyMap

OraLatencyMap is a performance widget running on SQL*plus (Oracle's CLI) to collect and visualize latency histograms for Oracle wait events using heat maps.

Packages on PyPi

· SparkMeasure - SparkMeasure is a tool for performance troubleshooting of Apache Spark workloads.

o It simplifies the collection and analysis of Spark performance metrics. The bulk of sparkMeasure is written in Scala. This package contains the Python API for sparkMeasure and is intended to work in conjunction with PySpark. Use from PySpark, or in Jupyter notebook environments, or in general as a tool to instrument Spark jobs in your Python code. Link to sparkMeasure GitHub page and documentation

· SparkHistogram - Sparkhistogram contains helper functions for generating data histograms with the Spark DataFrame API.

o Link to SparkHistogram source code and documentation

· TPCDS_PySpark – TPCDS_PySpark is a TPC-DS workload generator written in Python and designed to run at scale using Apache Spark. Use it to build your own Apache Spark Performance Lab, run performance benchmarking and learn about troubleshooting Spark.

· Test_CPU_parallel - Use test_CPU_parallel to generate CPU-intensive load on a system, running multiple threads in parallel.

o The tool runs a CPU-burning loop concurrently on the system, with configurable parallelism. The tool outputs a measurement of the CPU-burning loop execution time as a function of load. Link to test-CPU-parallel source code and documentation

Images on DockerHub

· Spark-Dashboard – Spark-dashboard is a container image to deploy an Apache Spark performance dashboard, it packaged Grafana, InfluxDB, the configuration for ingesting Spark metrics, and prebuilt Grafana dashboards for Spark performance visualization. See the project home at Spark Performance Dashboard

· Test_cpu_parallel – Use test_cpu_parallel to generate CPU-intensive load on Linux, Rust version. See the project home Test_CPU_parallel_Rust

· Test_cpu_parallel.py – Use test_cpu_parallel.py to generate CPU-intensive load on Linux, Python version. See the project home Test_CPU_parallel_Python

Blog at http://externaltable.blogspot.com
and contributing to http://db-blog.web.cern.ch

· Building an Apache Spark Performance Lab: Tools and Techniques for Spark Optimization

· Enhancing Apache Spark Performance with Flame Graphs: A Practical Example Using Grafana Pyroscope

· Performance Comparison of 5 JDKs on Apache Spark

· Building a Semantic Search Engine and RAG Applications with Vector Databases and Large Language Models

· Exploratory Notebooks for Deep Learning and Data Tools: A Beginner's Guide

· CPU Load Testing Exercises: Tools and Analysis for Oracle Database Servers

· Making histograms with Apache Spark and other SQL engines

· Can High Energy Physics Analysis Profit from Apache Spark APIs?

· Apache Spark 3.0 Memory Monitoring Improvements

· Distributed Deep Learning for Physics with TensorFlow and Kubernetes

· Machine Learning Pipelines for High Energy Physics Using Apache Spark with BigDL and Analytics Zoo

· A Performance Dashboard for Apache Spark

· SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads

· Performance Analysis of a CPU-Intensive Workload in Apache Spark

· Apache Spark and CERN Open Data Analysis, an Example

· Diving into Spark and Parquet Workloads, by Example

· On Measuring Apache Spark Workload Metrics for Performance Troubleshooting

· Spark notes (hosted on GitHub):

o Miscellaneous Spark commands, tips, info

o Spark performance dashboard config details

o Spark workload measurements with sparkMeasure

o Spark executor memory

o Spark and Parquet

o Apache Spark – HBase Connector

o Spark for_High_Energy_Physics

o Spark Histrograms

o Flame Graph, tools on Linux for profiling Spark

o Read/analyze Spark EventLog with Spark SQL

o Tools for Linux memory_performance measurements

o Spark SQL, a fun UDF_example with Mandelbrot set

o Linux_OS_CPU_Disk_Network monitoring tools

o Tools_for Apache Parquet_diagnostics

o MapInArrow for Python UDF

o Spark and OpenSearch

o Example of a Scala project for Spark

· Posters and reports:

o Spark Executors’ Memory Configuration, Office Poster

o Data Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure

o Machine learning pipelines with Apache Spark and Intel BigDL

o Physics data analysis and data reduction at scale with Apache Spark

o Physics data processing and machine learning in the cloud

Older blog entries: (2016) IPython/Jupyter SQL Magic Functions for PySpark, Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs, How to Buld a Neural Network Scoring Engine in PL/SQL, IPython/Jupyter Notebooks for Oracle, Linux BPF/bcc for Oracle Tracing, IPython Notebooks for Querying Apache Impala, SystemTap Guru Mode and Oracle SQL Parsing, PerfSheet.js: Oracle AWR Data Visualization in the Browser with JavaScript Pivot Charts, Linux Perf Probes for Oracle Tracing (2015) Extended Stack Profiling - Ideas, Tools and Comments, Slides of the CERN Talks at UKOUG Tech15, Oracle Wait Events Investigated with Extended Stack Profiling and Flame Graphs, Linux Kernel Stack Profiling and Flame Graphs Applied to Oracle Investigations, Add Color to Your SQL, Diagnose High-Latency I/O Operations Using SystemTap, Heat Map Visualization of Latency Histograms for NetApp C-Mode, Event Histogram Metric and Oracle 12c, Heat Map Visualization of I/O Latency with SystemTap and PyLatencyMap, Latest Updates to PerfSheet4, a Tool for Oracle AWR Data Mining and Visualization (2014) Talks at UKOUG TECH 2014 with CERN Speakers, Life of an Oracle I/O: Tracing Logical and Physical I/O with SystemTap, SystemTap into Oracle for Fun and Profit, Scaling up Cardinality Estimates in 12.1.0.2, ASM Metadata, Internals and Diagnostic Utilities, Oracle Optimizer Investigated with Flame Graphs, Flame Graphs for Oracle, A Closer Look at CALIBRATE_IO, Recent Updates of OraLatencyMap and PyLatencyMap, Wait Event History Sampling, an Experiment in Oracle Performance Analysis, Clusterware 12c and Restricted Service Registration for RAC (2013) How to Recover Files from a Dropped ASM Disk Group, UKOUG Tech13, Latency Investigations and Lost Writes, Daylight Saving Time Change and AWR Data Mining, Getting Started with PyLatencyMap: Latency Heat Maps for Oracle, DTrace and More Sources, PyLatencyMap, a Performance Tool for Latency Data Visualization, DTrace Explorations of Oracle Wait Events on Linux and Solaris, OraLatencyMap v1.1 and Testing I/O with SLOB 2, Oracle Events' Latency Visualization and Heat Maps in SQL*plus, Testing Lost Writes with Oracle and Data Guard, AWR Analytics and Oracle Performance Visualization with PerfSheet4 (2012) Active Data Guard and UKOUG 2012, Command-Line DBA Scripts, How to Turn Off Adaptive Cursor Sharing, Cardinality Feedback and Serial Direct Read, Recursive Subquery Factoring, Oracle SQL and Physics, Listener.ora and Oraagent in RAC 11gR2, Purging Cursors From the Library Cache Using Full_hash_value, Kerberos Authentication and Proxy Users, Hash Collisions in Oracle: SQL Signature and SQL_ID, SQL Signature, Text Normalization and MD5 Hash, SQL Patch and Force Match, V$EVENT_HISTOGRAM_METRIC, Performance Metrics Views, Of I/O Latency, Skew and Histograms 2/2, Of I/O Latency, Skew and Histograms 1/2

Publications:

The ATLAS EventIndex - A BigData Catalogue for All ATLAS Experiment Events, D. Barberis et al., Comput Softw Big Sci 7, 2 (2023)
Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics, Matteo Migliorini, Riccardo Castellotti, Luca Canali, Marco Zanetti, Comput Softw Big Sci 4, 8 (2020)
ScienceBox Converging to Kubernetes containers in production for on-premises and hybrid clouds for CERNBox, SWAN, and EOS, Enrico Bocchi, Luca Canali, Diogo Castro, Prasanth Kothuri, Hugo Gonzalez Labrador, Maciej Malawski, Jakub T. Mościcki and Piotr Mrowczynski, EPJ Web of Conferences 245, 07047 (2020)
Using Big Data Technologies for HEP Analysis, M. Cremonesi et al., EPJ Web of Conferences 214, 06030 (2019)
Evolution of the Hadoop Platform and Ecosystem for High Energy Physics, Z. Baranowski et al., EPJ Web of Conferences 214, 04058 (2019)
A prototype for the evolution of ATLAS EventIndex based on Apache Kudu storage, Z. Baranowski et al., EPJ Web of Conferences 214, 04057 (2019)
Big Data Tools and Cloud Services for High Energy Physics Analysis in TOTEM Experiment, V. Avati et al., 2018, Proceeding of: 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC Companion)
CMS Analysis and Data Reduction with Apache Spark, O. Gutsche et al. 2018 J. Phys.: Conf. Ser.1085 042030
A study of data representation in Hadoop to optimize data storage and search performance for the ATLAS EventIndex, Zbigniew Baranowski, Luca Canali, Rainer Toebbicke, Julius Hrivnac and Dario Barberis, 2017 J. Phys.: Conf. Ser. 898 062020
Integration of Oracle and Hadoop: Hybrid Databases Affordable at Scale, Luca Canali, Zbigniew Baranowski and Prasanth Kothuri, 2017 J. Phys.: Conf. Ser. 898 042055
An Oracle-based event index for ATLAS, Elizabeth J Gallas, Gancho Dimitrov, Petya Vasileva, Zbigniew Baranowski, Luca Canali, Andrei Dumitru, Andrea Formica, 2017 J. Phys.: Conf. Ser. 898 042033
Scale Out Databases for CERN Use Cases, Zbigniew Baranowski, Maciej Grzybek, Luca Canali, Daniel Lanza Garcia, Kacper Surdy, 2015 J. Phys.: Conf. Ser. 664(4) 042002
Evolution of Database Replication Technologies for WLCG, Zbigniew Baranowski, Lorena Lobato Pardavila, Marcin Blaszczyk, Gancho Dimitrov, Luca Canali, 2015 J. Phys.: Conf. Ser. 664(4) 042032
Sequential data access with Oracle and Hadoop: a performance comparison, Zbigniew Baranowski, Luca Canali and Eric Grancher, 2014 J. Phys.: Conf. Ser. 513 042001
ATLAS database application enhancements using Oracle 11g, G Dimitrov, L Canali, M Blaszczyk and R Sorokoletov, 2012 J. Phys.: Conf. Ser. 396 052027
ATLAS Data Management Accounting with Hadoop Pig and HBase, Mario Lassnig, Vincent Garonne, Gancho Dimitrov and Luca Canali, 2012 J. Phys.: Conf. Ser. 396 052044
Structured storage in ATLAS Distributed Data Management: use cases and experiences, Mario Lassnig, Vincent Garonne, Angelos Molfetas, Thomas Beermann, Gancho Dimitrov, Luca Canali, Donal Zang and Lisa Azzurra Chinzer, 2012 J. Phys.: Conf. Ser. 396 052045
Advanced Technologies for Scalable ATLAS Conditions Database Access, R Basset, L Canali, G Dimitrov, M Girone, R Hawkings, P Nevski, A Valassi, A Vaniachine, F Viegas, R Walker and A Wong, 2010 J. Phys.: Conf. Ser. 219 042025

Miscellaneous:

ASM_Internals (pdf), ASM metadata and related X$ tables
ASM_Utilities (pdf), ASM support utilities, metadata management (kfed, amdu)
http://cern.ch/canali/resources.htm
Contact details
PGP public key: gpg2 --keyserver hkp://pool.sks-keyservers.net --recv-keys EF1D88DB

Last updated, April 2024

Luca.Canali@cern.ch - @LucaCanaliDB - Home Page

Presentations, Talks and Videos:

Repositories at https://github.com/LucaCanali

Blog at http://externaltable.blogspot.com and contributing to http://db-blog.web.cern.ch

Publications:

Blog at http://externaltable.blogspot.com
and contributing to http://db-blog.web.cern.ch