Data Preprocessing and Visualization with PMF_toolkits¶

In this notebook, we've demonstrated the complete workflow for PMF analysis using PMF_toolkits:

Data Preprocessing:
- Loading and examining raw data
- Handling detection limits and missing values
- Calculating uncertainties
- Computing signal-to-noise ratios and correlations
- Selecting appropriate species
Data Analysis:
- Time series analysis
- Regression and correlation analysis
- Seasonal pattern analysis
- Cluster analysis
PMF Visualization:
- Factor profile visualization
- Time series contribution plots
- Seasonal contribution analysis
- Pollution level analysis
- Comprehensive factor visualization

1. Setup and Data Loading¶

In [2]:

Copied!





import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from datetime import datetime, timedelta

from PMF_toolkits.preprocessing import PMFPreprocessor
from PMF_toolkits.core import PMF
from PMF_toolkits.utils import add_season, get_sourceColor, format_xaxis_timeseries, pretty_specie

# Set up plotting
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 6)

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from datetime import datetime, timedelta

from PMF_toolkits.preprocessing import PMFPreprocessor
from PMF_toolkits.core import PMF
from PMF_toolkits.utils import add_season, get_sourceColor, format_xaxis_timeseries, pretty_specie

# Set up plotting
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 6)

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

1.1 Generate Sample Data¶

First, let's create a synthetic dataset that mimics real environmental data, including:

Time series of measurements for multiple species
Missing values
Below detection limit values
Seasonal patterns

In [3]:

Copied!





# Create a date range for 2 years of daily data
dates = pd.date_range(start='2020-01-01', end='2021-12-31', freq='D')

# Initialize an empty DataFrame with the dates as index
data = pd.DataFrame(index=dates)

# Set random seed for reproducibility
np.random.seed(42)

# Generate data for PM10 with seasonal pattern (higher in winter)
winter_effect = 20 * (np.cos(np.pi * np.arange(len(dates))/182.5) + 1)
data['PM10'] = 25 + winter_effect + np.random.normal(0, 10, len(dates))
data['PM10'] = data['PM10'].clip(lower=0)  # No negative concentrations

# Generate correlated species
# SO4 - high in summer due to photochemistry
summer_effect = 5 * (-np.cos(np.pi * np.arange(len(dates))/182.5) + 1)
data['SO4'] = 3 + summer_effect + np.random.normal(0, 1, len(dates))
data['SO4'] = data['SO4'].clip(lower=0)

# NO3 - high in winter
data['NO3'] = 2 + 0.15 * winter_effect + np.random.normal(0, 1, len(dates))
data['NO3'] = data['NO3'].clip(lower=0)

# NH4 - correlated with SO4 and NO3
data['NH4'] = 0.3 * data['SO4'] + 0.2 * data['NO3'] + np.random.normal(0, 0.5, len(dates))
data['NH4'] = data['NH4'].clip(lower=0)

# OC - organic carbon, higher in winter (wood burning) and summer (biogenic)
data['OC'] = 5 + 0.1 * winter_effect + 0.05 * summer_effect + np.random.normal(0, 2, len(dates))
data['OC'] = data['OC'].clip(lower=0)

# EC - elemental carbon, correlated with OC
data['EC'] = 0.3 * data['OC'] + np.random.normal(0, 0.7, len(dates))
data['EC'] = data['EC'].clip(lower=0)

# Na - sea salt, higher in windy conditions
wind_effect = 3 * np.sin(np.pi * np.arange(len(dates))/91.25) + 3
data['Na'] = 1 + 0.1 * wind_effect + np.random.normal(0, 0.5, len(dates))
data['Na'] = data['Na'].clip(lower=0)

# Cl - sea salt, correlated with Na
data['Cl'] = 1.5 * data['Na'] + np.random.normal(0, 0.3, len(dates))
data['Cl'] = data['Cl'].clip(lower=0)

# K - biomass burning, higher in winter
data['K'] = 0.5 + 0.05 * winter_effect + np.random.normal(0, 0.2, len(dates))
data['K'] = data['K'].clip(lower=0)

# Ca - crustal, no strong seasonality
data['Ca'] = 0.8 + np.random.normal(0, 0.3, len(dates))
data['Ca'] = data['Ca'].clip(lower=0)

# Mg - crustal, correlated with Ca
data['Mg'] = 0.4 * data['Ca'] + np.random.normal(0, 0.1, len(dates))
data['Mg'] = data['Mg'].clip(lower=0)

# Fe - crustal, correlated with Ca
data['Fe'] = 0.5 * data['Ca'] + np.random.normal(0, 0.2, len(dates))
data['Fe'] = data['Fe'].clip(lower=0)

# Add some missing values (randomly distribute ~5% missing values)
for col in data.columns:
    mask = np.random.random(len(data)) < 0.05
    data.loc[mask, col] = np.nan

# Add some below detection limit values
# Define detection limits for each species
detection_limits = {
    'PM10': 1.0,
    'SO4': 0.1,
    'NO3': 0.1,
    'NH4': 0.05,
    'OC': 0.5,
    'EC': 0.2,
    'Na': 0.1,
    'Cl': 0.1,
    'K': 0.05,
    'Ca': 0.05,
    'Mg': 0.02,
    'Fe': 0.05
}

# Replace values below detection limit with a marker
for col, dl in detection_limits.items():
    mask = data[col] < dl
    data.loc[mask, col] = '<QL'

# Display the first few rows of the dataset
data.head()
# Create a date range for 2 years of daily data
dates = pd.date_range(start='2020-01-01', end='2021-12-31', freq='D')

# Initialize an empty DataFrame with the dates as index
data = pd.DataFrame(index=dates)

# Set random seed for reproducibility
np.random.seed(42)

# Generate data for PM10 with seasonal pattern (higher in winter)
winter_effect = 20 * (np.cos(np.pi * np.arange(len(dates))/182.5) + 1)
data['PM10'] = 25 + winter_effect + np.random.normal(0, 10, len(dates))
data['PM10'] = data['PM10'].clip(lower=0)  # No negative concentrations

# Generate correlated species
# SO4 - high in summer due to photochemistry
summer_effect = 5 * (-np.cos(np.pi * np.arange(len(dates))/182.5) + 1)
data['SO4'] = 3 + summer_effect + np.random.normal(0, 1, len(dates))
data['SO4'] = data['SO4'].clip(lower=0)

# NO3 - high in winter
data['NO3'] = 2 + 0.15 * winter_effect + np.random.normal(0, 1, len(dates))
data['NO3'] = data['NO3'].clip(lower=0)

# NH4 - correlated with SO4 and NO3
data['NH4'] = 0.3 * data['SO4'] + 0.2 * data['NO3'] + np.random.normal(0, 0.5, len(dates))
data['NH4'] = data['NH4'].clip(lower=0)

# OC - organic carbon, higher in winter (wood burning) and summer (biogenic)
data['OC'] = 5 + 0.1 * winter_effect + 0.05 * summer_effect + np.random.normal(0, 2, len(dates))
data['OC'] = data['OC'].clip(lower=0)

# EC - elemental carbon, correlated with OC
data['EC'] = 0.3 * data['OC'] + np.random.normal(0, 0.7, len(dates))
data['EC'] = data['EC'].clip(lower=0)

# Na - sea salt, higher in windy conditions
wind_effect = 3 * np.sin(np.pi * np.arange(len(dates))/91.25) + 3
data['Na'] = 1 + 0.1 * wind_effect + np.random.normal(0, 0.5, len(dates))
data['Na'] = data['Na'].clip(lower=0)

# Cl - sea salt, correlated with Na
data['Cl'] = 1.5 * data['Na'] + np.random.normal(0, 0.3, len(dates))
data['Cl'] = data['Cl'].clip(lower=0)

# K - biomass burning, higher in winter
data['K'] = 0.5 + 0.05 * winter_effect + np.random.normal(0, 0.2, len(dates))
data['K'] = data['K'].clip(lower=0)

# Ca - crustal, no strong seasonality
data['Ca'] = 0.8 + np.random.normal(0, 0.3, len(dates))
data['Ca'] = data['Ca'].clip(lower=0)

# Mg - crustal, correlated with Ca
data['Mg'] = 0.4 * data['Ca'] + np.random.normal(0, 0.1, len(dates))
data['Mg'] = data['Mg'].clip(lower=0)

# Fe - crustal, correlated with Ca
data['Fe'] = 0.5 * data['Ca'] + np.random.normal(0, 0.2, len(dates))
data['Fe'] = data['Fe'].clip(lower=0)

# Add some missing values (randomly distribute ~5% missing values)
for col in data.columns:
    mask = np.random.random(len(data)) < 0.05
    data.loc[mask, col] = np.nan

# Add some below detection limit values
# Define detection limits for each species
detection_limits = {
    'PM10': 1.0,
    'SO4': 0.1,
    'NO3': 0.1,
    'NH4': 0.05,
    'OC': 0.5,
    'EC': 0.2,
    'Na': 0.1,
    'Cl': 0.1,
    'K': 0.05,
    'Ca': 0.05,
    'Mg': 0.02,
    'Fe': 0.05
}

# Replace values below detection limit with a marker
for col, dl in detection_limits.items():
    mask = data[col] < dl
    data.loc[mask, col] = '

Out[3]:

	PM10	SO4	NO3	NH4	OC	EC	Na	Cl	K	Ca	Mg	Fe
2020-01-01	69.967142	2.021627	7.790977	1.483097	9.183505	3.00995	0.95686	1.258455	2.31571	0.262077	0.059805	0.094389
2020-01-02	63.614394	3.408994	7.149035	2.547358	9.504222	2.856291	0.593991	0.416258	2.386773	0.713628	0.209851	0.068432
2020-01-03	71.465033	1.300379	7.417699	1.542662	8.766505	NaN	1.393559	2.33539	2.548034	0.873512	NaN	0.625942
2020-01-04	80.203635	4.035822	8.584579	3.140606	9.427093	2.940471	1.62358	2.021191	2.404626	0.719014	0.092953	0.550142
2020-01-05	NaN	3.484446	9.662795	2.987467	12.152089	5.194398	1.598625	2.267092	2.620097	0.336965	0.022993	0.326297

1.2 Examine the Raw Data¶

Let's examine the raw data to see the distribution of values, including missing and below detection limit values.

In [4]:

Copied!

detection_limits
detection_limits

Out[4]:

{'PM10': 1.0,
 'SO4': 0.1,
 'NO3': 0.1,
 'NH4': 0.05,
 'OC': 0.5,
 'EC': 0.2,
 'Na': 0.1,
 'Cl': 0.1,
 'K': 0.05,
 'Ca': 0.05,
 'Mg': 0.02,
 'Fe': 0.05}

In [5]:

Copied!





# Create a PMFPreprocessor instance with our data
preprocessor = PMFPreprocessor(data, ql_values=detection_limits)

# Track values below detection limit
dl_mask = preprocessor.track_quantification_limits()

# Get data quality summary
quality_summary = preprocessor.summarize_data_quality()
display(quality_summary)
# Create a PMFPreprocessor instance with our data
preprocessor = PMFPreprocessor(data, ql_values=detection_limits)

# Track values below detection limit
dl_mask = preprocessor.track_quantification_limits()

# Get data quality summary
quality_summary = preprocessor.summarize_data_quality()
display(quality_summary)

	PM10	SO4	NO3	NH4	OC	EC	Na	Cl	K	Ca	Mg	Fe
Missing	5.471956	4.103967	5.198358	5.06156	4.924761	5.745554	5.608755	6.019152	3.967168	4.240766	4.103967	4.787962
Below QL	0.136799	0.0	0.410397	0.0	0.410397	2.735978	0.820793	0.683995	0.0	0.547196	3.283174	8.207934

In [6]:

Copied!





# Visualize missing and below detection limit values
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
sns.heatmap(data.isna(), cmap='viridis', cbar_kws={'label': 'Missing Values'}, yticklabels=False)
plt.title('Missing Values Distribution')
plt.xlabel('Variables')
plt.ylabel('Samples')

plt.subplot(1, 2, 2)
sns.heatmap(dl_mask, cmap='viridis', cbar_kws={'label': 'Below QL Values'}, yticklabels=False)
plt.title('Below Detection Limit Values Distribution')
plt.xlabel('Variables')

plt.tight_layout()
plt.show()
# Visualize missing and below detection limit values
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
sns.heatmap(data.isna(), cmap='viridis', cbar_kws={'label': 'Missing Values'}, yticklabels=False)
plt.title('Missing Values Distribution')
plt.xlabel('Variables')
plt.ylabel('Samples')

plt.subplot(1, 2, 2)
sns.heatmap(dl_mask, cmap='viridis', cbar_kws={'label': 'Below QL Values'}, yticklabels=False)
plt.title('Below Detection Limit Values Distribution')
plt.xlabel('Variables')

plt.tight_layout()
plt.show()

No description has been provided for this image

2. Data Preprocessing¶

Now we'll go through the complete preprocessing workflow for PMF analysis:

2.1 Convert to Numeric Values¶

First, we need to convert all values to numeric, replacing below detection limit markers with appropriate numeric values (typically DL/2).

In [7]:

Copied!

# Convert data to numeric with proper handling of detection limits
data_numeric = preprocessor.convert_to_numeric()

# Display the first few rows to confirm conversion
data_numeric.head()
# Convert data to numeric with proper handling of detection limits
data_numeric = preprocessor.convert_to_numeric()

# Display the first few rows to confirm conversion
data_numeric.head()

Out[7]:

	PM10	SO4	NO3	NH4	OC	EC	Na	Cl	K	Ca	Mg	Fe
2020-01-01	69.967142	2.021627	7.790977	1.483097	9.183505	3.009950	0.956860	1.258455	2.315710	0.262077	0.059805	0.094389
2020-01-02	63.614394	3.408994	7.149035	2.547358	9.504222	2.856291	0.593991	0.416258	2.386773	0.713628	0.209851	0.068432
2020-01-03	71.465033	1.300379	7.417699	1.542662	8.766505	NaN	1.393559	2.335390	2.548034	0.873512	NaN	0.625942
2020-01-04	80.203635	4.035822	8.584579	3.140606	9.427093	2.940471	1.623580	2.021191	2.404626	0.719014	0.092953	0.550142
2020-01-05	NaN	3.484446	9.662795	2.987467	12.152089	5.194398	1.598625	2.267092	2.620097	0.336965	0.022993	0.326297

2.2 Handle Missing Values (optional)¶

Next, we need to handle missing values using various methods:

In [8]:

Copied!





# Compare different missing value handling methods
methods = {
    'interpolate': preprocessor.handle_missing_values(method='interpolate', data=data_numeric),
    'mean': preprocessor.handle_missing_values(method='mean', data=data_numeric),
    'median': preprocessor.handle_missing_values(method='median', data=data_numeric),
    'min_fraction': preprocessor.handle_missing_values(method='min_fraction', data=data_numeric, fraction=0.5)
}

# Check if missing values were filled
for method_name, data_filled in methods.items():
    print(f"{method_name}: {data_filled.isna().sum().sum()} missing values remaining")
# Compare different missing value handling methods
methods = {
    'interpolate': preprocessor.handle_missing_values(method='interpolate', data=data_numeric),
    'mean': preprocessor.handle_missing_values(method='mean', data=data_numeric),
    'median': preprocessor.handle_missing_values(method='median', data=data_numeric),
    'min_fraction': preprocessor.handle_missing_values(method='min_fraction', data=data_numeric, fraction=0.5)
}

# Check if missing values were filled
for method_name, data_filled in methods.items():
    print(f"{method_name}: {data_filled.isna().sum().sum()} missing values remaining")

interpolate: 0 missing values remaining
mean: 0 missing values remaining
median: 0 missing values remaining
min_fraction: 0 missing values remaining

In [9]:

Copied!





# Let's use interpolation for our analysis
data_filled = methods['interpolate']

# Plot an example to see how different methods filled missing values
fig, ax = plt.subplots(figsize=(12, 6))

# Select a time slice with missing values
missing_mask = data_numeric['PM10'].isna()
if missing_mask.sum() > 0:
    start_idx = missing_mask.idxmax() - pd.Timedelta(days=10)
    end_idx = missing_mask.idxmax() + pd.Timedelta(days=10)
    
    # Plot raw data with missing values
    ax.plot(data_numeric.loc[start_idx:end_idx, 'PM10'], 'o-', label='Raw Data', alpha=0.7)
    
    # Plot different filling methods
    for method_name, data_method in methods.items():
        ax.plot(data_method.loc[start_idx:end_idx, 'PM10'], 'o-', label=f'Filled ({method_name})')
    
    ax.set_title('Comparison of Missing Value Treatment Methods')
    ax.set_ylabel('PM10 Concentration (µg m$^{-3}$')
    ax.legend()
    plt.tight_layout()
    plt.show()
else:
    print("No missing values in PM10 for demonstration, please adjust the example")
# Let's use interpolation for our analysis
data_filled = methods['interpolate']

# Plot an example to see how different methods filled missing values
fig, ax = plt.subplots(figsize=(12, 6))

# Select a time slice with missing values
missing_mask = data_numeric['PM10'].isna()
if missing_mask.sum() > 0:
    start_idx = missing_mask.idxmax() - pd.Timedelta(days=10)
    end_idx = missing_mask.idxmax() + pd.Timedelta(days=10)
    
    # Plot raw data with missing values
    ax.plot(data_numeric.loc[start_idx:end_idx, 'PM10'], 'o-', label='Raw Data', alpha=0.7)
    
    # Plot different filling methods
    for method_name, data_method in methods.items():
        ax.plot(data_method.loc[start_idx:end_idx, 'PM10'], 'o-', label=f'Filled ({method_name})')
    
    ax.set_title('Comparison of Missing Value Treatment Methods')
    ax.set_ylabel('PM10 Concentration (µg m$^{-3}$')
    ax.legend()
    plt.tight_layout()
    plt.show()
else:
    print("No missing values in PM10 for demonstration, please adjust the example")

No missing values in PM10 for demonstration, please adjust the example

2.3 Calculate Uncertainties¶

Now we'll compute uncertainties using different methods available in the PMFPreprocessor:

In [10]:

Copied!





# Calculate uncertainties using different methods
uncertainty_methods = {
    'percentage': preprocessor.compute_uncertainties(
        method='percentage', 
        params={'percentage': 0.1}
    ),
    'DL': preprocessor.compute_uncertainties(
        method='DL', 
        params={'DL': detection_limits}
    ),
    'polissar': preprocessor.compute_uncertainties(
        method='polissar', 
        params={
            'DL': detection_limits,
            'error_fraction': 0.1
        }
    )
}

# Compare uncertainty calculation methods for a specific species
species_to_check = 'SO4'
fig, ax = plt.subplots(figsize=(12, 6))

# Plot the original data
ax.plot(data_filled[species_to_check], 'o-', alpha=0.5, label='Data')

# Plot uncertainties as shaded regions
for method_name, uncertainties in uncertainty_methods.items():
    ax.fill_between(
        data_filled.index,
        data_filled[species_to_check] - uncertainties[species_to_check],
        data_filled[species_to_check] + uncertainties[species_to_check],
        alpha=0.2,
        label=f'Uncertainty ({method_name})'
    )

ax.set_title(f'Comparison of Uncertainty Calculation Methods for {species_to_check}')
ax.set_ylabel(f'{species_to_check} Concentration (µg m$^-3$)')
ax.legend()
plt.tight_layout()
plt.show()
# Calculate uncertainties using different methods
uncertainty_methods = {
    'percentage': preprocessor.compute_uncertainties(
        method='percentage', 
        params={'percentage': 0.1}
    ),
    'DL': preprocessor.compute_uncertainties(
        method='DL', 
        params={'DL': detection_limits}
    ),
    'polissar': preprocessor.compute_uncertainties(
        method='polissar', 
        params={
            'DL': detection_limits,
            'error_fraction': 0.1
        }
    )
}

# Compare uncertainty calculation methods for a specific species
species_to_check = 'SO4'
fig, ax = plt.subplots(figsize=(12, 6))

# Plot the original data
ax.plot(data_filled[species_to_check], 'o-', alpha=0.5, label='Data')

# Plot uncertainties as shaded regions
for method_name, uncertainties in uncertainty_methods.items():
    ax.fill_between(
        data_filled.index,
        data_filled[species_to_check] - uncertainties[species_to_check],
        data_filled[species_to_check] + uncertainties[species_to_check],
        alpha=0.2,
        label=f'Uncertainty ({method_name})'
    )

ax.set_title(f'Comparison of Uncertainty Calculation Methods for {species_to_check}')
ax.set_ylabel(f'{species_to_check} Concentration (µg m$^-3$)')
ax.legend()
plt.tight_layout()
plt.show()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[10], line 29
     27 # Plot uncertainties as shaded regions
     28 for method_name, uncertainties in uncertainty_methods.items():
---> 29     ax.fill_between(
     30         data_filled.index,
     31         data_filled[species_to_check] - uncertainties[species_to_check],
     32         data_filled[species_to_check] + uncertainties[species_to_check],
     33         alpha=0.2,
     34         label=f'Uncertainty ({method_name})'
     35     )
     37 ax.set_title(f'Comparison of Uncertainty Calculation Methods for {species_to_check}')
     38 ax.set_ylabel(f'{species_to_check} Concentration (µg m$^-3$)')

File c:\Users\dinhngot\AppData\Local\anaconda3\Lib\site-packages\matplotlib\__init__.py:1521, in _preprocess_data.<locals>.inner(ax, data, *args, **kwargs)
   1518 @functools.wraps(func)
   1519 def inner(ax, *args, data=None, **kwargs):
   1520     if data is None:
-> 1521         return func(
   1522             ax,
   1523             *map(cbook.sanitize_sequence, args),
   1524             **{k: cbook.sanitize_sequence(v) for k, v in kwargs.items()})
   1526     bound = new_sig.bind(ax, *args, **kwargs)
   1527     auto_label = (bound.arguments.get(label_namer)
   1528                   or bound.kwargs.get(label_namer))

File c:\Users\dinhngot\AppData\Local\anaconda3\Lib\site-packages\matplotlib\axes\_axes.py:5685, in Axes.fill_between(self, x, y1, y2, where, interpolate, step, **kwargs)
   5683 def fill_between(self, x, y1, y2=0, where=None, interpolate=False,
   5684                  step=None, **kwargs):
-> 5685     return self._fill_between_x_or_y(
   5686         "x", x, y1, y2,
   5687         where=where, interpolate=interpolate, step=step, **kwargs)

File c:\Users\dinhngot\AppData\Local\anaconda3\Lib\site-packages\matplotlib\axes\_axes.py:5667, in Axes._fill_between_x_or_y(self, ind_dir, ind, dep1, dep2, where, interpolate, step, **kwargs)
   5664     if not any(c in kwargs for c in ("color", "facecolor")):
   5665         kwargs["facecolor"] = self._get_patches_for_fill.get_next_color()
-> 5667 ind, dep1, dep2 = self._fill_between_process_units(
   5668     ind_dir, dep_dir, ind, dep1, dep2, **kwargs)
   5670 collection = mcoll.FillBetweenPolyCollection(
   5671     ind_dir, ind, dep1, dep2,
   5672     where=where, interpolate=interpolate, step=step, **kwargs)
   5674 self.add_collection(collection)

File c:\Users\dinhngot\AppData\Local\anaconda3\Lib\site-packages\numpy\ma\core.py:2415, in masked_invalid(a, copy)
   2387 """
   2388 Mask an array where invalid values occur (NaNs or infs).
   2389 
   (...)
   2412 
   2413 """
   2414 a = np.array(a, copy=None, subok=True)
-> 2415 res = masked_where(~(np.isfinite(a)), a, copy=copy)
   2416 # masked_invalid previously never returned nomask as a mask and doing so
   2417 # threw off matplotlib (gh-22842).  So use shrink=False:
   2418 if res._mask is nomask:

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Let's use the Polissar method for our uncertainties, as it's commonly used in PMF studies:

In [21]:

Copied!

uncertainties.describe()
uncertainties.describe()

Out[21]:

	PM10	SO4	NO3	NH4	OC	EC	Na	Cl	K	Ca	Mg	Fe
count	691.000000	701.000000	693.000000	694.000000	695.000000	689.000000	690.000000	687.000000	702.000000	700.00000	701.000000	696.00000
unique	691.000000	701.000000	691.000000	694.000000	693.000000	670.000000	685.000000	683.000000	702.000000	697.00000	678.000000	637.00000
top	6.695845	0.202669	0.017579	0.504351	0.087896	0.035158	0.017579	0.017579	0.287438	0.00879	0.003516	0.00879
freq	1.000000	1.000000	3.000000	1.000000	3.000000	20.000000	6.000000	5.000000	1.000000	4.00000	24.000000	60.00000

In [22]:

Copied!





# Select the uncertainties to use
uncertainties = uncertainty_methods['polissar']

# Display the uncertainty summary
uncertainty_summary = uncertainties.describe().T
#uncertainty_summary['relative_uncertainty'] = uncertainty_summary['mean'] / data_filled.mean()
#display(uncertainty_summary[['mean', 'std', 'min', 'max', 'relative_uncertainty']])
# Select the uncertainties to use
uncertainties = uncertainty_methods['polissar']

# Display the uncertainty summary
uncertainty_summary = uncertainties.describe().T
#uncertainty_summary['relative_uncertainty'] = uncertainty_summary['mean'] / data_filled.mean()
#display(uncertainty_summary[['mean', 'std', 'min', 'max', 'relative_uncertainty']])

2.4 Calculate Signal-to-Noise Ratio¶

Signal-to-noise ratio is important for deciding which species to include in the PMF analysis:

In [11]:

Copied!

sn_ratio_df = preprocessor.calculate_signal_to_noise()
sn_ratio_df = preprocessor.calculate_signal_to_noise()

In [13]:

Copied!





# Calculate signal-to-noise ratio using calculate_signal_to_noise method
sn_ratio_df = preprocessor.calculate_signal_to_noise()

# Display the signal-to-noise ratios
print("Signal-to-Noise Ratios:")
print(sn_ratio_df.sort_values('S/N', ascending=False))

# Extract the series for plotting
sn_ratio = sn_ratio_df['S/N']
# Calculate signal-to-noise ratio using calculate_signal_to_noise method
sn_ratio_df = preprocessor.calculate_signal_to_noise()

# Display the signal-to-noise ratios
print("Signal-to-Noise Ratios:")
print(sn_ratio_df.sort_values('S/N', ascending=False))

# Extract the series for plotting
sn_ratio = sn_ratio_df['S/N']

Signal-to-Noise Ratios:
           S/N
PM10  8.507524
SO4   7.587993
NO3   7.489282
Ca    7.483388
OC    7.453301
Cl    7.360436
Na     7.32999
Mg    7.277671
K     7.131311
EC    7.120884
Fe    6.771209
NH4   4.663178

In [23]:

Copied!





# Use the S/N ratio calculated from calculate_signal_to_noise method above
# Check if we have valid S/N ratios before plotting
if len(sn_ratio) > 0 and sn_ratio.sum() > 0:
    # Plot signal-to-noise ratio for all species
    plt.figure(figsize=(12, 6))
    sn_ratio.sort_values().plot(kind='barh')
    plt.axvline(x=2, color='r', linestyle='--', label='Weak (S/N = 2)')
    plt.axvline(x=1, color='r', linestyle='-', label='Bad (S/N = 1)')
    plt.title('Signal-to-Noise Ratio by Species')
    plt.xlabel('Signal-to-Noise Ratio')
    plt.tight_layout()
    plt.legend()
    plt.show()
else:
    print("No valid signal-to-noise ratios were calculated.")
    print("Using a simplified calculation for demonstration purposes:")
    
    # Create a simplified S/N calculation for demonstration
    data_numeric = preprocessor.convert_to_numeric()
    simple_sn = data_numeric.std() / data_numeric.std().min()
    
    plt.figure(figsize=(12, 6))
    simple_sn.sort_values().plot(kind='barh')
    plt.axvline(x=2, color='r', linestyle='--', label='Weak (S/N = 2)')
    plt.axvline(x=1, color='r', linestyle='-', label='Bad (S/N = 1)')
    plt.title('Simplified Signal Variability Ratio by Species (Demo)')
    plt.xlabel('Signal Variability Ratio')
    plt.tight_layout()
    plt.legend()
    plt.show()
# Use the S/N ratio calculated from calculate_signal_to_noise method above
# Check if we have valid S/N ratios before plotting
if len(sn_ratio) > 0 and sn_ratio.sum() > 0:
    # Plot signal-to-noise ratio for all species
    plt.figure(figsize=(12, 6))
    sn_ratio.sort_values().plot(kind='barh')
    plt.axvline(x=2, color='r', linestyle='--', label='Weak (S/N = 2)')
    plt.axvline(x=1, color='r', linestyle='-', label='Bad (S/N = 1)')
    plt.title('Signal-to-Noise Ratio by Species')
    plt.xlabel('Signal-to-Noise Ratio')
    plt.tight_layout()
    plt.legend()
    plt.show()
else:
    print("No valid signal-to-noise ratios were calculated.")
    print("Using a simplified calculation for demonstration purposes:")
    
    # Create a simplified S/N calculation for demonstration
    data_numeric = preprocessor.convert_to_numeric()
    simple_sn = data_numeric.std() / data_numeric.std().min()
    
    plt.figure(figsize=(12, 6))
    simple_sn.sort_values().plot(kind='barh')
    plt.axvline(x=2, color='r', linestyle='--', label='Weak (S/N = 2)')
    plt.axvline(x=1, color='r', linestyle='-', label='Bad (S/N = 1)')
    plt.title('Simplified Signal Variability Ratio by Species (Demo)')
    plt.xlabel('Signal Variability Ratio')
    plt.tight_layout()
    plt.legend()
    plt.show()

2.5 Compute Correlation Matrix¶

Let's examine correlations between species:

In [24]:

Copied!





# Compute correlation matrix
correlation_matrix = preprocessor.compute_correlation_matrix()

# Plot the correlation matrix as a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', square=True)
plt.title('Correlation Matrix of Species')
plt.tight_layout()
plt.show()
# Compute correlation matrix
correlation_matrix = preprocessor.compute_correlation_matrix()

# Plot the correlation matrix as a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', square=True)
plt.title('Correlation Matrix of Species')
plt.tight_layout()
plt.show()

In [25]:

Copied!

preprocessor.plot_heatmap(cluster = True)
preprocessor.plot_heatmap(cluster = True)

Out[25]:

2.6 Select Good Tracer Species¶

Let's identify good tracer species based on correlation and signal-to-noise:

In [26]:

Copied!

# Select good tracer species
tracers = preprocessor.select_tracers(correlation_threshold=0.3, sn_threshold=3.0)
print(f"Good tracer species: {tracers}")
# Select good tracer species
tracers = preprocessor.select_tracers(correlation_threshold=0.3, sn_threshold=3.0)
print(f"Good tracer species: {tracers}")

Good tracer species: ['PM10', 'SO4', 'NO3', 'NH4', 'OC', 'EC', 'Na', 'Cl', 'K', 'Ca', 'Mg', 'Fe']

2.7 Prepare Data for PMF¶

Now we'll filter the species based on data quality and prepare the final dataset for PMF analysis:

In [27]:

Copied!





# Filter species based on data quality
data_filtered = preprocessor.filter_species(min_valid=0.75)
print(f"Species retained after filtering: {data_filtered.columns.tolist()}")

# Prepare PMF input in one step
data_normalized, uncertainties_final, metadata = preprocessor.prepare_pmf_input(
    total_var='PM10',
    uncertainty_method='polissar',
    uncertainty_params={'DL': detection_limits, 'error_fraction': 0.1},
    min_valid=0.75,
    handling_method='interpolate'
)

print("\nMetadata from PMF preparation:")
for key, value in metadata.items():
    if key != 'uncertainty_params':
        print(f"  {key}: {value}")
        
# Display the normalized data
print("\nNormalized data (first 5 rows):")
display(data_normalized.head())
# Filter species based on data quality
data_filtered = preprocessor.filter_species(min_valid=0.75)
print(f"Species retained after filtering: {data_filtered.columns.tolist()}")

# Prepare PMF input in one step
data_normalized, uncertainties_final, metadata = preprocessor.prepare_pmf_input(
    total_var='PM10',
    uncertainty_method='polissar',
    uncertainty_params={'DL': detection_limits, 'error_fraction': 0.1},
    min_valid=0.75,
    handling_method='interpolate'
)

print("\nMetadata from PMF preparation:")
for key, value in metadata.items():
    if key != 'uncertainty_params':
        print(f"  {key}: {value}")
        
# Display the normalized data
print("\nNormalized data (first 5 rows):")
display(data_normalized.head())

Species retained after filtering: ['PM10', 'SO4', 'NO3', 'NH4', 'OC', 'EC', 'Na', 'Cl', 'K', 'Ca', 'Mg', 'Fe']

Metadata from PMF preparation:
  total_var: PM10
  uncertainty_method: polissar
  min_valid: 0.75
  handling_method: interpolate

Normalized data (first 5 rows):

	PM10	SO4	NO3	NH4	OC	EC	Na	Cl	K	Ca	Mg	Fe
2020-01-01	1.0	0.028894	0.111352	0.021197	0.131255	0.043019	0.013676	0.017986	0.033097	0.003746	0.000855	0.001349
2020-01-02	1.0	0.053588	0.112381	0.040044	0.149404	0.044900	0.009337	0.006543	0.037519	0.011218	0.003299	0.001076
2020-01-03	1.0	0.018196	0.103795	0.021586	0.122668	NaN	0.019500	0.032679	0.035654	0.012223	NaN	0.008759
2020-01-04	1.0	0.050320	0.107035	0.039158	0.117539	0.036663	0.020243	0.025201	0.029982	0.008965	0.001159	0.006859
2020-01-05	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

3. Exploratory Data Analysis¶

Let's explore our preprocessed data with the visualization tools from PMF_toolkits:

3.1 Time Series Analysis¶

Let's examine time series patterns using the preprocessing plotting functions:

In [29]:

Copied!





# Plot time series for selected species
fig = preprocessor.plot_timeseries(['PM10', 'SO4', 'NO3'], unit='µg m$^{-3}$')
plt.tight_layout()
plt.show()
# Plot time series for selected species
fig = preprocessor.plot_timeseries(['PM10', 'SO4', 'NO3'], unit='µg m$^{-3}$')
plt.tight_layout()
plt.show()

In [30]:

Copied!





# Plot time series with two y-axes to compare different scales
fig = preprocessor.plot_timeseries_2axis('PM10', 'SO4', unit='µg m$^{-3}$', show_equation=True)
plt.tight_layout()
plt.show()
# Plot time series with two y-axes to compare different scales
fig = preprocessor.plot_timeseries_2axis('PM10', 'SO4', unit='µg m$^{-3}$', show_equation=True)
plt.tight_layout()
plt.show()

3.2 Regression Analysis¶

Let's look at relationships between species using regression plots:

In [31]:

Copied!





# Create regression plots between related species
fig = preprocessor.regression_plot('SO4', 'NH4')
plt.tight_layout()
plt.show()
# Create regression plots between related species
fig = preprocessor.regression_plot('SO4', 'NH4')
plt.tight_layout()
plt.show()

In [32]:

Copied!





# Try cluster analysis to identify different regimes
fig, results = preprocessor.cluster_analysis('OC', 'EC', n_clusters=[2, 3])
plt.tight_layout()
plt.show()

# Display cluster statistics
for k, cluster_data in results.items():
    print(f"\nResults for k={k} clusters:")
    for cluster, stats in cluster_data['cluster_stats'].items():
        print(f"Cluster {cluster+1}: {stats['count']} points, R2 = {stats['r_squared']:.3f}, slope = {stats['slope']:.3f}")
# Try cluster analysis to identify different regimes
fig, results = preprocessor.cluster_analysis('OC', 'EC', n_clusters=[2, 3])
plt.tight_layout()
plt.show()

# Display cluster statistics
for k, cluster_data in results.items():
    print(f"\nResults for k={k} clusters:")
    for cluster, stats in cluster_data['cluster_stats'].items():
        print(f"Cluster {cluster+1}: {stats['count']} points, R2 = {stats['r_squared']:.3f}, slope = {stats['slope']:.3f}")

  File "c:\Users\dinhngot\AppData\Local\anaconda3\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
        "wmic CPU Get NumberOfCores /Format:csv".split(),
        capture_output=True,
        text=True,
    )
  File "c:\Users\dinhngot\AppData\Local\anaconda3\Lib\subprocess.py", line 554, in run
    with Popen(*popenargs, **kwargs) as process:
         ~~~~~^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\dinhngot\AppData\Local\anaconda3\Lib\subprocess.py", line 1039, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                        pass_fds, cwd, env,
                        ^^^^^^^^^^^^^^^^^^^
    ...<5 lines>...
                        gid, gids, uid, umask,
                        ^^^^^^^^^^^^^^^^^^^^^^
                        start_new_session, process_group)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\dinhngot\AppData\Local\anaconda3\Lib\subprocess.py", line 1554, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
                             # no special security
                             ^^^^^^^^^^^^^^^^^^^^^
    ...<4 lines>...
                             cwd,
                             ^^^^
                             startupinfo)
                             ^^^^^^^^^^^^

Results for k=2 clusters:
Cluster 1: 344 points, R2 = 0.278, slope = 0.270
Cluster 2: 312 points, R2 = 0.182, slope = 0.249

Results for k=3 clusters:
Cluster 1: 268 points, R2 = 0.026, slope = 0.126
Cluster 2: 181 points, R2 = 0.089, slope = 0.222
Cluster 3: 207 points, R2 = 0.126, slope = 0.184

3.3 Seasonal Analysis¶

Now let's use the PMF_toolkits utility functions to analyze seasonal patterns:

In [34]:

Copied!





# Add season information to the data
seasonal_data = add_season(data_filled)

# Calculate seasonal averages
seasonal_averages = seasonal_data.groupby('season').mean()

# Reorder seasons
season_order = ['Winter', 'Spring', 'Summer', 'Fall']
seasonal_averages = seasonal_averages.reindex(season_order)

# Plot seasonal patterns for key species
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

species_to_plot = ['PM10', 'SO4', 'NO3', 'OC']
for i, species in enumerate(species_to_plot):
    seasonal_averages[species].plot(kind='bar', ax=axes[i])
    axes[i].set_title(f'Seasonal Pattern - {species}')
    axes[i].set_ylabel('Concentration (µg/m$^{-3}$)')
    
plt.tight_layout()
plt.show()
# Add season information to the data
seasonal_data = add_season(data_filled)

# Calculate seasonal averages
seasonal_averages = seasonal_data.groupby('season').mean()

# Reorder seasons
season_order = ['Winter', 'Spring', 'Summer', 'Fall']
seasonal_averages = seasonal_averages.reindex(season_order)

# Plot seasonal patterns for key species
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

species_to_plot = ['PM10', 'SO4', 'NO3', 'OC']
for i, species in enumerate(species_to_plot):
    seasonal_averages[species].plot(kind='bar', ax=axes[i])
    axes[i].set_title(f'Seasonal Pattern - {species}')
    axes[i].set_ylabel('Concentration (µg/m$^{-3}$)')
    
plt.tight_layout()
plt.show()