Data analysis is concerned with extracting useful information and drawing meaningful inferences from measurements using a healthy blend of qualitative (e.g. visual) and quantitative (mathematical and statistical) analysis tools. A generic data analysis exercise involves several steps such as inspection, data pre-processing, transformation and modelling or classification; every stage involves inputs from the user, domain knowledge and hypothesis testing. In addition, it is important to take into account the end-use of data analysis, which could be prediction, classification, anomaly detection, etc. since it exerts a significant influence on the course of analysis including the data acquisition mechanism.
During our discussion with Prof Arun K Tangirala of the Department of Chemical Engineering, he smiles, and explains the details of his work on seismic data analysis. He exclaims ”It’s data analysis, and not data dialysis!” The reason being that data analysis involves extraction of desired information from data, whereas dialysis involves removal of waste from data. When analysis is conducted incorrectly, one may end up with waste or meaningless information. Moreover, it is not just about removing unwanted information but making sense out of what the analyst is left with, which is nothing short of a challenge.”
Prof Tangirala is part of the Process Systems Engineering & Data Sciences group, which consists of nearly fifty researchers (faculty, research fellows and scholars). The group is renowned world-wide for its work in process modelling, data analysis, control, monitoring and optimization. The members engage in active and cutting-edge research that is not only theoretically rich, but is also deeply engaged in several high-value interdisciplinary and socially relevant projects through industrial consultancy, sponsorship, and collaborations with leading researchers in the world. On the skill development and training front, several foundational, advanced short-term, and full-semester courses on time-series analysis, system identification, multivariate data analysis, optimization, graph theory, statistical data analysis, and advanced control are offered round the year – a feat that is unique across the globe for a single department.
Understanding Seismic Activity and Analysis
Among the different domains of data, the seismic or under-the-ground data assumes enormous significance for obvious reasons. The ground is continuously at unrest mainly due to waves in the ocean, changes in earth’s crust, atmospheric variations and human activities.
Regardless of the source, all seismic events release energy proportional to their scale. This released energy moves out in all directions as a wave, and is recorded by a seismometer or a seismograph. Seismograms are recordings of ground vibratory motion obtained from a seismometer. Analysis of these seismograms is vital to our understanding of earth’s activity, especially in setting up early warning systems for earthquakes, to determine their source locations, and detect other seismic events such as explosions. Detecting or predicting the damaging part of an earthquake holds great value in protecting lives and preventing loss of property.
Seismic signals are broadly classified as Body (travel through interior of Earth) and Surface waves (on the surface of Earth). Body waves are comprised of P waves (compressional waves, first one to reach the seismometer) and S waves (transverse in nature). Body waves cause high frequency vibrations and are less destructive, whereas surface waves, which are comprised of love and Rayleigh waves, cause low frequency vibrations and are destructive in nature.
Very little time interval exists between the arrival of two different waves. Fortunately, the P wave that arrives first is the least harmful and therefore the detection of its onset is critical to the success of any earthquake warning system. Data is either recorded from a single three-component station, which is an array or a network of stations. Data acquired from seismic array or network, as opposed to a single station, enables better signal detection and source location.
The single three-component station measures three spatial components (N-S, E-W, vertical). While it may have large error margins, it is the most cost effective system. It also gives us information about depth and strength of the event.
An array has nearly 9 to 25 single stations with 10 to 20 km spacing. It improves the signal-to-noise ratio (SNR), thereby making it easy to distinguish the signal from background noise. Furthermore, it also facilitates the determination of source azimuth (direction from which signal arrived at the station), local slowness and epicentral distance of the source.
A network of stations, on the other hand, has local, regional or global distribution of more than 50 stations with a common data center. This helps in more accurate event detection.
Through the analysis of seismic signals from a few or many seismic stations, one can determine the origin time and location of the event and estimate its magnitude (M) – a measure of energy released by seismic sources.
The primary goal of seismic data analysis is to increase the reliability of earthquakes probability estimates. This analysis not only helps in predicting the onset of seismic wave and locating the source of the event, but also in identification of underground nuclear explosions and in imaging Earth’s deep interior structures (tomography). It is also used in preparing the seismic risk maps for highly prone regions.
The specific objective of this Board of Nuclear Sciences (BRNS) sponsored project titled “Analysis of Seismic Data for Unsupervised Detection and Classification” is to develop minimal human intervention tools to detect the onset of the P wave so that efficient evacuation and significant reduction in damage caused by the earthquake can be achieved by warning the areas which might get affected by the earthquake. In addition, a classification-based method for distinguishing between earthquakes and explosions will be developed. Finally, the team aims to equip these methods with means of handling missing data due to communication losses using ideas from compressive sensing.
A prime challenge in seismic data analysis is that post detection of onset of P wave accurately, the lead time (P-S time interval) for issuing warnings is very short, about 2 minutes.
Further, low signal-to-noise ratio (SNR), i.e. large amounts of uncertainty since a typical P wave amplitude is usually comparable to that of noise, have to be handled.
A universal time-series model or statistical inference cannot be applied as the seismic signal characteristics vary with geographical location, country, local climate etc.
Despite its unique challenges, analysis of seismic data shares certain commonalities with other domains. For instance, detecting the onset of P wave has close analogies with detection of faults in process industry, with some subtle differences though. In a different and interesting study, strong similarities have been observed between the conditions leading to earthquake and those in a faulty heart through ECG recordings. This is often referred to as the ‘human heart-type characterisation in the development of an earthquake’.
Overview of the Project
The main objective of this work is to develop unsupervised (automated) methods of seismic data analysis so as to minimize user intervention in determining event onsets, source location and classification of earthquakes that eventually reduces reporting time. An additional objective is to develop methods for distinguishing earthquakes from explosions.
Stage 1: Noise Modelling
This stage primarily deals with developing the statistical models for background noise using statistical modelling techniques. Depending on the noise characteristics, a time-series model will be developed. With a statistical/time-series model of the noise in hand, a suitable method for the determination of the onset of earthquakes (P waves) will be devised.
Stage 2: Multi-scale analysis
Once an event is detected accurately, a multiscale analysis of signals is carried out to separate the events occurring at each scale. This scale separation enhances the characterization of seismic events, which eventually impact the ability of classifier at the next stage.
Stage 3: Classification
Classification of seismic events is nothing but the ability to identify two different events at a single station in the presence of large noise levels in order to reduce the reporting time. This final stage involves the development of an unsupervised classifier for discriminating two different seismic events at a single station. Communication losses can severely affect the ability to continually analyze data. In this work, we aim to address these issues by treating the missing mechanism as random and use ideas from the emerging fields of compressive sensing and sparse optimization.
The following figure provides an overview of the approach envisioned for this project. All methods should satisfy the requirements of accuracy and reliability in addition to computation and implementation costs.
Current Status and Future Goals
Understanding the noise model is critical for sound interpretation of the event data. Thus, seismic noise from different stations (worldwide) is being analysed. The present stage of analysis is concerned with rigorous statistical tests for noise characterization (stationarity, linearity, etc.). A systematic procedure that can be universally applied to seismic noise across the globe. Statistical models for seismic noise are being subsequently developed and their variations across the globe are being studied.
Future goals involve validation of noise models, development of methods for detecting onset of P waves and realizing the rest of the project objectives. All models developed shall be validated by the BARC Seismic Division team.
Some background information
The principal collaborator on the Bhabha Atomic Research Center (BARC), Mumbai is Dr. Siddhartha Mukhopadhyay, head of the seismic division team and a leading scientist at BARC. From IIT Madras’s front, the project is led by Prof. Arun K. Tangirala. Other members of this team include Ms. Kanchan Aggarwal, a project associate and research scholar at IIT Madras, who is pursuing her doctoral degree on this project.
Datasets for this work are presently obtained from Incorporated Research Institutions for Seismology (IRIS), [IRIS provides management of, and access to, observed and derived data for the global earth science community which includes ground motion, atmospheric, infrasonic and hydrological data]. In addition to IRIS, several other sources to download earthquake data are available – SEG, ORFEUS, U.S Geological Survey (USGS), UTAM Seismic data library, etc. Analysis is carried out using a blend of univariate and multivariate stochastic (time-series modelling) and deterministic (e.g. Fourier and wavelet analysis) signal analysis tools in MATLAB and R software.
Meet the Prof
Dr. Arun K. Tangirala is a Professor at the Department of Chemical Engineering, IIT Madras since 2004. His research interests span the fields of process control and monitoring, system identification, applied signal processing and fuel cell systems. He has been teaching several full-term and short-term courses on process control, system identification, theory and applications of wavelet transforms, random processes and fuel cell systems.
Meet the Author
Sriraghav Srinivasan is an unapologetic foodie who never misses an opportunity to travel. Raghav would probably be found hunched over the latest Archer book or binge-watching the TV show F.R.I.E.N.D.S. He’s an undergraduate Junior at the Biotech Department who’s interested in a variety of things from Neuroscience to Financial Markets.