Data Preprocessing and Visualization with PMF_toolkits¶
In this notebook, we've demonstrated the complete workflow for PMF analysis using PMF_toolkits:
Data Preprocessing:
- Loading and examining raw data
- Handling detection limits and missing values
- Calculating uncertainties
- Computing signal-to-noise ratios and correlations
- Selecting appropriate species
Data Analysis:
- Time series analysis
- Regression and correlation analysis
- Seasonal pattern analysis
- Cluster analysis
PMF Visualization:
- Factor profile visualization
- Time series contribution plots
- Seasonal contribution analysis
- Pollution level analysis
- Comprehensive factor visualization
1. Setup and Data Loading¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from datetime import datetime, timedelta
from PMF_toolkits.preprocessing import PMFPreprocessor
from PMF_toolkits.core import PMF
from PMF_toolkits.utils import add_season, get_sourceColor, format_xaxis_timeseries, pretty_specie
# Set up plotting
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 6)
# Suppress warnings
import warnings
warnings.filterwarnings('ignore')
1.1 Generate Sample Data¶
First, let's create a synthetic dataset that mimics real environmental data, including:
- Time series of measurements for multiple species
- Missing values
- Below detection limit values
- Seasonal patterns
# Create a date range for 2 years of daily data
dates = pd.date_range(start='2020-01-01', end='2021-12-31', freq='D')
# Initialize an empty DataFrame with the dates as index
data = pd.DataFrame(index=dates)
# Set random seed for reproducibility
np.random.seed(42)
# Generate data for PM10 with seasonal pattern (higher in winter)
winter_effect = 20 * (np.cos(np.pi * np.arange(len(dates))/182.5) + 1)
data['PM10'] = 25 + winter_effect + np.random.normal(0, 10, len(dates))
data['PM10'] = data['PM10'].clip(lower=0) # No negative concentrations
# Generate correlated species
# SO4 - high in summer due to photochemistry
summer_effect = 5 * (-np.cos(np.pi * np.arange(len(dates))/182.5) + 1)
data['SO4'] = 3 + summer_effect + np.random.normal(0, 1, len(dates))
data['SO4'] = data['SO4'].clip(lower=0)
# NO3 - high in winter
data['NO3'] = 2 + 0.15 * winter_effect + np.random.normal(0, 1, len(dates))
data['NO3'] = data['NO3'].clip(lower=0)
# NH4 - correlated with SO4 and NO3
data['NH4'] = 0.3 * data['SO4'] + 0.2 * data['NO3'] + np.random.normal(0, 0.5, len(dates))
data['NH4'] = data['NH4'].clip(lower=0)
# OC - organic carbon, higher in winter (wood burning) and summer (biogenic)
data['OC'] = 5 + 0.1 * winter_effect + 0.05 * summer_effect + np.random.normal(0, 2, len(dates))
data['OC'] = data['OC'].clip(lower=0)
# EC - elemental carbon, correlated with OC
data['EC'] = 0.3 * data['OC'] + np.random.normal(0, 0.7, len(dates))
data['EC'] = data['EC'].clip(lower=0)
# Na - sea salt, higher in windy conditions
wind_effect = 3 * np.sin(np.pi * np.arange(len(dates))/91.25) + 3
data['Na'] = 1 + 0.1 * wind_effect + np.random.normal(0, 0.5, len(dates))
data['Na'] = data['Na'].clip(lower=0)
# Cl - sea salt, correlated with Na
data['Cl'] = 1.5 * data['Na'] + np.random.normal(0, 0.3, len(dates))
data['Cl'] = data['Cl'].clip(lower=0)
# K - biomass burning, higher in winter
data['K'] = 0.5 + 0.05 * winter_effect + np.random.normal(0, 0.2, len(dates))
data['K'] = data['K'].clip(lower=0)
# Ca - crustal, no strong seasonality
data['Ca'] = 0.8 + np.random.normal(0, 0.3, len(dates))
data['Ca'] = data['Ca'].clip(lower=0)
# Mg - crustal, correlated with Ca
data['Mg'] = 0.4 * data['Ca'] + np.random.normal(0, 0.1, len(dates))
data['Mg'] = data['Mg'].clip(lower=0)
# Fe - crustal, correlated with Ca
data['Fe'] = 0.5 * data['Ca'] + np.random.normal(0, 0.2, len(dates))
data['Fe'] = data['Fe'].clip(lower=0)
# Add some missing values (randomly distribute ~5% missing values)
for col in data.columns:
mask = np.random.random(len(data)) < 0.05
data.loc[mask, col] = np.nan
# Add some below detection limit values
# Define detection limits for each species
detection_limits = {
'PM10': 1.0,
'SO4': 0.1,
'NO3': 0.1,
'NH4': 0.05,
'OC': 0.5,
'EC': 0.2,
'Na': 0.1,
'Cl': 0.1,
'K': 0.05,
'Ca': 0.05,
'Mg': 0.02,
'Fe': 0.05
}
# Replace values below detection limit with a marker
for col, dl in detection_limits.items():
mask = data[col] < dl
data.loc[mask, col] = '<QL'
# Display the first few rows of the dataset
data.head()
| PM10 | SO4 | NO3 | NH4 | OC | EC | Na | Cl | K | Ca | Mg | Fe | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020-01-01 | 69.967142 | 2.021627 | 7.790977 | 1.483097 | 9.183505 | 3.00995 | 0.95686 | 1.258455 | 2.31571 | 0.262077 | 0.059805 | 0.094389 |
| 2020-01-02 | 63.614394 | 3.408994 | 7.149035 | 2.547358 | 9.504222 | 2.856291 | 0.593991 | 0.416258 | 2.386773 | 0.713628 | 0.209851 | 0.068432 |
| 2020-01-03 | 71.465033 | 1.300379 | 7.417699 | 1.542662 | 8.766505 | NaN | 1.393559 | 2.33539 | 2.548034 | 0.873512 | NaN | 0.625942 |
| 2020-01-04 | 80.203635 | 4.035822 | 8.584579 | 3.140606 | 9.427093 | 2.940471 | 1.62358 | 2.021191 | 2.404626 | 0.719014 | 0.092953 | 0.550142 |
| 2020-01-05 | NaN | 3.484446 | 9.662795 | 2.987467 | 12.152089 | 5.194398 | 1.598625 | 2.267092 | 2.620097 | 0.336965 | 0.022993 | 0.326297 |
1.2 Examine the Raw Data¶
Let's examine the raw data to see the distribution of values, including missing and below detection limit values.
detection_limits
{'PM10': 1.0,
'SO4': 0.1,
'NO3': 0.1,
'NH4': 0.05,
'OC': 0.5,
'EC': 0.2,
'Na': 0.1,
'Cl': 0.1,
'K': 0.05,
'Ca': 0.05,
'Mg': 0.02,
'Fe': 0.05}
# Create a PMFPreprocessor instance with our data
preprocessor = PMFPreprocessor(data, ql_values=detection_limits)
# Track values below detection limit
dl_mask = preprocessor.track_quantification_limits()
# Get data quality summary
quality_summary = preprocessor.summarize_data_quality()
display(quality_summary)
| PM10 | SO4 | NO3 | NH4 | OC | EC | Na | Cl | K | Ca | Mg | Fe | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Missing | 5.471956 | 4.103967 | 5.198358 | 5.06156 | 4.924761 | 5.745554 | 5.608755 | 6.019152 | 3.967168 | 4.240766 | 4.103967 | 4.787962 |
| Below QL | 0.136799 | 0.0 | 0.410397 | 0.0 | 0.410397 | 2.735978 | 0.820793 | 0.683995 | 0.0 | 0.547196 | 3.283174 | 8.207934 |
# Visualize missing and below detection limit values
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
sns.heatmap(data.isna(), cmap='viridis', cbar_kws={'label': 'Missing Values'}, yticklabels=False)
plt.title('Missing Values Distribution')
plt.xlabel('Variables')
plt.ylabel('Samples')
plt.subplot(1, 2, 2)
sns.heatmap(dl_mask, cmap='viridis', cbar_kws={'label': 'Below QL Values'}, yticklabels=False)
plt.title('Below Detection Limit Values Distribution')
plt.xlabel('Variables')
plt.tight_layout()
plt.show()
2. Data Preprocessing¶
Now we'll go through the complete preprocessing workflow for PMF analysis:
2.1 Convert to Numeric Values¶
First, we need to convert all values to numeric, replacing below detection limit markers with appropriate numeric values (typically DL/2).
# Convert data to numeric with proper handling of detection limits
data_numeric = preprocessor.convert_to_numeric()
# Display the first few rows to confirm conversion
data_numeric.head()
| PM10 | SO4 | NO3 | NH4 | OC | EC | Na | Cl | K | Ca | Mg | Fe | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020-01-01 | 69.967142 | 2.021627 | 7.790977 | 1.483097 | 9.183505 | 3.009950 | 0.956860 | 1.258455 | 2.315710 | 0.262077 | 0.059805 | 0.094389 |
| 2020-01-02 | 63.614394 | 3.408994 | 7.149035 | 2.547358 | 9.504222 | 2.856291 | 0.593991 | 0.416258 | 2.386773 | 0.713628 | 0.209851 | 0.068432 |
| 2020-01-03 | 71.465033 | 1.300379 | 7.417699 | 1.542662 | 8.766505 | NaN | 1.393559 | 2.335390 | 2.548034 | 0.873512 | NaN | 0.625942 |
| 2020-01-04 | 80.203635 | 4.035822 | 8.584579 | 3.140606 | 9.427093 | 2.940471 | 1.623580 | 2.021191 | 2.404626 | 0.719014 | 0.092953 | 0.550142 |
| 2020-01-05 | NaN | 3.484446 | 9.662795 | 2.987467 | 12.152089 | 5.194398 | 1.598625 | 2.267092 | 2.620097 | 0.336965 | 0.022993 | 0.326297 |
2.2 Handle Missing Values (optional)¶
Next, we need to handle missing values using various methods:
# Compare different missing value handling methods
methods = {
'interpolate': preprocessor.handle_missing_values(method='interpolate', data=data_numeric),
'mean': preprocessor.handle_missing_values(method='mean', data=data_numeric),
'median': preprocessor.handle_missing_values(method='median', data=data_numeric),
'min_fraction': preprocessor.handle_missing_values(method='min_fraction', data=data_numeric, fraction=0.5)
}
# Check if missing values were filled
for method_name, data_filled in methods.items():
print(f"{method_name}: {data_filled.isna().sum().sum()} missing values remaining")
interpolate: 0 missing values remaining mean: 0 missing values remaining median: 0 missing values remaining min_fraction: 0 missing values remaining
# Let's use interpolation for our analysis
data_filled = methods['interpolate']
# Plot an example to see how different methods filled missing values
fig, ax = plt.subplots(figsize=(12, 6))
# Select a time slice with missing values
missing_mask = data_numeric['PM10'].isna()
if missing_mask.sum() > 0:
start_idx = missing_mask.idxmax() - pd.Timedelta(days=10)
end_idx = missing_mask.idxmax() + pd.Timedelta(days=10)
# Plot raw data with missing values
ax.plot(data_numeric.loc[start_idx:end_idx, 'PM10'], 'o-', label='Raw Data', alpha=0.7)
# Plot different filling methods
for method_name, data_method in methods.items():
ax.plot(data_method.loc[start_idx:end_idx, 'PM10'], 'o-', label=f'Filled ({method_name})')
ax.set_title('Comparison of Missing Value Treatment Methods')
ax.set_ylabel('PM10 Concentration (µg m$^{-3}$')
ax.legend()
plt.tight_layout()
plt.show()
else:
print("No missing values in PM10 for demonstration, please adjust the example")
No missing values in PM10 for demonstration, please adjust the example
2.3 Calculate Uncertainties¶
Now we'll compute uncertainties using different methods available in the PMFPreprocessor:
# Calculate uncertainties using different methods
uncertainty_methods = {
'percentage': preprocessor.compute_uncertainties(
method='percentage',
params={'percentage': 0.1}
),
'DL': preprocessor.compute_uncertainties(
method='DL',
params={'DL': detection_limits}
),
'polissar': preprocessor.compute_uncertainties(
method='polissar',
params={
'DL': detection_limits,
'error_fraction': 0.1
}
)
}
# Compare uncertainty calculation methods for a specific species
species_to_check = 'SO4'
fig, ax = plt.subplots(figsize=(12, 6))
# Plot the original data
ax.plot(data_filled[species_to_check], 'o-', alpha=0.5, label='Data')
# Plot uncertainties as shaded regions
for method_name, uncertainties in uncertainty_methods.items():
ax.fill_between(
data_filled.index,
data_filled[species_to_check] - uncertainties[species_to_check],
data_filled[species_to_check] + uncertainties[species_to_check],
alpha=0.2,
label=f'Uncertainty ({method_name})'
)
ax.set_title(f'Comparison of Uncertainty Calculation Methods for {species_to_check}')
ax.set_ylabel(f'{species_to_check} Concentration (µg m$^-3$)')
ax.legend()
plt.tight_layout()
plt.show()
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[10], line 29 27 # Plot uncertainties as shaded regions 28 for method_name, uncertainties in uncertainty_methods.items(): ---> 29 ax.fill_between( 30 data_filled.index, 31 data_filled[species_to_check] - uncertainties[species_to_check], 32 data_filled[species_to_check] + uncertainties[species_to_check], 33 alpha=0.2, 34 label=f'Uncertainty ({method_name})' 35 ) 37 ax.set_title(f'Comparison of Uncertainty Calculation Methods for {species_to_check}') 38 ax.set_ylabel(f'{species_to_check} Concentration (µg m$^-3$)') File c:\Users\dinhngot\AppData\Local\anaconda3\Lib\site-packages\matplotlib\__init__.py:1521, in _preprocess_data.<locals>.inner(ax, data, *args, **kwargs) 1518 @functools.wraps(func) 1519 def inner(ax, *args, data=None, **kwargs): 1520 if data is None: -> 1521 return func( 1522 ax, 1523 *map(cbook.sanitize_sequence, args), 1524 **{k: cbook.sanitize_sequence(v) for k, v in kwargs.items()}) 1526 bound = new_sig.bind(ax, *args, **kwargs) 1527 auto_label = (bound.arguments.get(label_namer) 1528 or bound.kwargs.get(label_namer)) File c:\Users\dinhngot\AppData\Local\anaconda3\Lib\site-packages\matplotlib\axes\_axes.py:5685, in Axes.fill_between(self, x, y1, y2, where, interpolate, step, **kwargs) 5683 def fill_between(self, x, y1, y2=0, where=None, interpolate=False, 5684 step=None, **kwargs): -> 5685 return self._fill_between_x_or_y( 5686 "x", x, y1, y2, 5687 where=where, interpolate=interpolate, step=step, **kwargs) File c:\Users\dinhngot\AppData\Local\anaconda3\Lib\site-packages\matplotlib\axes\_axes.py:5667, in Axes._fill_between_x_or_y(self, ind_dir, ind, dep1, dep2, where, interpolate, step, **kwargs) 5664 if not any(c in kwargs for c in ("color", "facecolor")): 5665 kwargs["facecolor"] = self._get_patches_for_fill.get_next_color() -> 5667 ind, dep1, dep2 = self._fill_between_process_units( 5668 ind_dir, dep_dir, ind, dep1, dep2, **kwargs) 5670 collection = mcoll.FillBetweenPolyCollection( 5671 ind_dir, ind, dep1, dep2, 5672 where=where, interpolate=interpolate, step=step, **kwargs) 5674 self.add_collection(collection) File c:\Users\dinhngot\AppData\Local\anaconda3\Lib\site-packages\numpy\ma\core.py:2415, in masked_invalid(a, copy) 2387 """ 2388 Mask an array where invalid values occur (NaNs or infs). 2389 (...) 2412 2413 """ 2414 a = np.array(a, copy=None, subok=True) -> 2415 res = masked_where(~(np.isfinite(a)), a, copy=copy) 2416 # masked_invalid previously never returned nomask as a mask and doing so 2417 # threw off matplotlib (gh-22842). So use shrink=False: 2418 if res._mask is nomask: TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Let's use the Polissar method for our uncertainties, as it's commonly used in PMF studies:
uncertainties.describe()
| PM10 | SO4 | NO3 | NH4 | OC | EC | Na | Cl | K | Ca | Mg | Fe | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 691.000000 | 701.000000 | 693.000000 | 694.000000 | 695.000000 | 689.000000 | 690.000000 | 687.000000 | 702.000000 | 700.00000 | 701.000000 | 696.00000 |
| unique | 691.000000 | 701.000000 | 691.000000 | 694.000000 | 693.000000 | 670.000000 | 685.000000 | 683.000000 | 702.000000 | 697.00000 | 678.000000 | 637.00000 |
| top | 6.695845 | 0.202669 | 0.017579 | 0.504351 | 0.087896 | 0.035158 | 0.017579 | 0.017579 | 0.287438 | 0.00879 | 0.003516 | 0.00879 |
| freq | 1.000000 | 1.000000 | 3.000000 | 1.000000 | 3.000000 | 20.000000 | 6.000000 | 5.000000 | 1.000000 | 4.00000 | 24.000000 | 60.00000 |
# Select the uncertainties to use
uncertainties = uncertainty_methods['polissar']
# Display the uncertainty summary
uncertainty_summary = uncertainties.describe().T
#uncertainty_summary['relative_uncertainty'] = uncertainty_summary['mean'] / data_filled.mean()
#display(uncertainty_summary[['mean', 'std', 'min', 'max', 'relative_uncertainty']])
2.4 Calculate Signal-to-Noise Ratio¶
Signal-to-noise ratio is important for deciding which species to include in the PMF analysis:
sn_ratio_df = preprocessor.calculate_signal_to_noise()
# Calculate signal-to-noise ratio using calculate_signal_to_noise method
sn_ratio_df = preprocessor.calculate_signal_to_noise()
# Display the signal-to-noise ratios
print("Signal-to-Noise Ratios:")
print(sn_ratio_df.sort_values('S/N', ascending=False))
# Extract the series for plotting
sn_ratio = sn_ratio_df['S/N']
Signal-to-Noise Ratios:
S/N
PM10 8.507524
SO4 7.587993
NO3 7.489282
Ca 7.483388
OC 7.453301
Cl 7.360436
Na 7.32999
Mg 7.277671
K 7.131311
EC 7.120884
Fe 6.771209
NH4 4.663178
# Use the S/N ratio calculated from calculate_signal_to_noise method above
# Check if we have valid S/N ratios before plotting
if len(sn_ratio) > 0 and sn_ratio.sum() > 0:
# Plot signal-to-noise ratio for all species
plt.figure(figsize=(12, 6))
sn_ratio.sort_values().plot(kind='barh')
plt.axvline(x=2, color='r', linestyle='--', label='Weak (S/N = 2)')
plt.axvline(x=1, color='r', linestyle='-', label='Bad (S/N = 1)')
plt.title('Signal-to-Noise Ratio by Species')
plt.xlabel('Signal-to-Noise Ratio')
plt.tight_layout()
plt.legend()
plt.show()
else:
print("No valid signal-to-noise ratios were calculated.")
print("Using a simplified calculation for demonstration purposes:")
# Create a simplified S/N calculation for demonstration
data_numeric = preprocessor.convert_to_numeric()
simple_sn = data_numeric.std() / data_numeric.std().min()
plt.figure(figsize=(12, 6))
simple_sn.sort_values().plot(kind='barh')
plt.axvline(x=2, color='r', linestyle='--', label='Weak (S/N = 2)')
plt.axvline(x=1, color='r', linestyle='-', label='Bad (S/N = 1)')
plt.title('Simplified Signal Variability Ratio by Species (Demo)')
plt.xlabel('Signal Variability Ratio')
plt.tight_layout()
plt.legend()
plt.show()
2.5 Compute Correlation Matrix¶
Let's examine correlations between species:
# Compute correlation matrix
correlation_matrix = preprocessor.compute_correlation_matrix()
# Plot the correlation matrix as a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', square=True)
plt.title('Correlation Matrix of Species')
plt.tight_layout()
plt.show()
preprocessor.plot_heatmap(cluster = True)
2.6 Select Good Tracer Species¶
Let's identify good tracer species based on correlation and signal-to-noise:
# Select good tracer species
tracers = preprocessor.select_tracers(correlation_threshold=0.3, sn_threshold=3.0)
print(f"Good tracer species: {tracers}")
Good tracer species: ['PM10', 'SO4', 'NO3', 'NH4', 'OC', 'EC', 'Na', 'Cl', 'K', 'Ca', 'Mg', 'Fe']
2.7 Prepare Data for PMF¶
Now we'll filter the species based on data quality and prepare the final dataset for PMF analysis:
# Filter species based on data quality
data_filtered = preprocessor.filter_species(min_valid=0.75)
print(f"Species retained after filtering: {data_filtered.columns.tolist()}")
# Prepare PMF input in one step
data_normalized, uncertainties_final, metadata = preprocessor.prepare_pmf_input(
total_var='PM10',
uncertainty_method='polissar',
uncertainty_params={'DL': detection_limits, 'error_fraction': 0.1},
min_valid=0.75,
handling_method='interpolate'
)
print("\nMetadata from PMF preparation:")
for key, value in metadata.items():
if key != 'uncertainty_params':
print(f" {key}: {value}")
# Display the normalized data
print("\nNormalized data (first 5 rows):")
display(data_normalized.head())
Species retained after filtering: ['PM10', 'SO4', 'NO3', 'NH4', 'OC', 'EC', 'Na', 'Cl', 'K', 'Ca', 'Mg', 'Fe'] Metadata from PMF preparation: total_var: PM10 uncertainty_method: polissar min_valid: 0.75 handling_method: interpolate Normalized data (first 5 rows):
| PM10 | SO4 | NO3 | NH4 | OC | EC | Na | Cl | K | Ca | Mg | Fe | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020-01-01 | 1.0 | 0.028894 | 0.111352 | 0.021197 | 0.131255 | 0.043019 | 0.013676 | 0.017986 | 0.033097 | 0.003746 | 0.000855 | 0.001349 |
| 2020-01-02 | 1.0 | 0.053588 | 0.112381 | 0.040044 | 0.149404 | 0.044900 | 0.009337 | 0.006543 | 0.037519 | 0.011218 | 0.003299 | 0.001076 |
| 2020-01-03 | 1.0 | 0.018196 | 0.103795 | 0.021586 | 0.122668 | NaN | 0.019500 | 0.032679 | 0.035654 | 0.012223 | NaN | 0.008759 |
| 2020-01-04 | 1.0 | 0.050320 | 0.107035 | 0.039158 | 0.117539 | 0.036663 | 0.020243 | 0.025201 | 0.029982 | 0.008965 | 0.001159 | 0.006859 |
| 2020-01-05 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3. Exploratory Data Analysis¶
Let's explore our preprocessed data with the visualization tools from PMF_toolkits:
3.1 Time Series Analysis¶
Let's examine time series patterns using the preprocessing plotting functions:
# Plot time series for selected species
fig = preprocessor.plot_timeseries(['PM10', 'SO4', 'NO3'], unit='µg m$^{-3}$')
plt.tight_layout()
plt.show()
# Plot time series with two y-axes to compare different scales
fig = preprocessor.plot_timeseries_2axis('PM10', 'SO4', unit='µg m$^{-3}$', show_equation=True)
plt.tight_layout()
plt.show()
3.2 Regression Analysis¶
Let's look at relationships between species using regression plots:
# Create regression plots between related species
fig = preprocessor.regression_plot('SO4', 'NH4')
plt.tight_layout()
plt.show()
# Try cluster analysis to identify different regimes
fig, results = preprocessor.cluster_analysis('OC', 'EC', n_clusters=[2, 3])
plt.tight_layout()
plt.show()
# Display cluster statistics
for k, cluster_data in results.items():
print(f"\nResults for k={k} clusters:")
for cluster, stats in cluster_data['cluster_stats'].items():
print(f"Cluster {cluster+1}: {stats['count']} points, R2 = {stats['r_squared']:.3f}, slope = {stats['slope']:.3f}")
File "c:\Users\dinhngot\AppData\Local\anaconda3\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
cpu_info = subprocess.run(
"wmic CPU Get NumberOfCores /Format:csv".split(),
capture_output=True,
text=True,
)
File "c:\Users\dinhngot\AppData\Local\anaconda3\Lib\subprocess.py", line 554, in run
with Popen(*popenargs, **kwargs) as process:
~~~~~^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\dinhngot\AppData\Local\anaconda3\Lib\subprocess.py", line 1039, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pass_fds, cwd, env,
^^^^^^^^^^^^^^^^^^^
...<5 lines>...
gid, gids, uid, umask,
^^^^^^^^^^^^^^^^^^^^^^
start_new_session, process_group)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\dinhngot\AppData\Local\anaconda3\Lib\subprocess.py", line 1554, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
# no special security
^^^^^^^^^^^^^^^^^^^^^
...<4 lines>...
cwd,
^^^^
startupinfo)
^^^^^^^^^^^^
Results for k=2 clusters: Cluster 1: 344 points, R2 = 0.278, slope = 0.270 Cluster 2: 312 points, R2 = 0.182, slope = 0.249 Results for k=3 clusters: Cluster 1: 268 points, R2 = 0.026, slope = 0.126 Cluster 2: 181 points, R2 = 0.089, slope = 0.222 Cluster 3: 207 points, R2 = 0.126, slope = 0.184
3.3 Seasonal Analysis¶
Now let's use the PMF_toolkits utility functions to analyze seasonal patterns:
# Add season information to the data
seasonal_data = add_season(data_filled)
# Calculate seasonal averages
seasonal_averages = seasonal_data.groupby('season').mean()
# Reorder seasons
season_order = ['Winter', 'Spring', 'Summer', 'Fall']
seasonal_averages = seasonal_averages.reindex(season_order)
# Plot seasonal patterns for key species
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
species_to_plot = ['PM10', 'SO4', 'NO3', 'OC']
for i, species in enumerate(species_to_plot):
seasonal_averages[species].plot(kind='bar', ax=axes[i])
axes[i].set_title(f'Seasonal Pattern - {species}')
axes[i].set_ylabel('Concentration (µg/m$^{-3}$)')
plt.tight_layout()
plt.show()