Detecting behavioural patterns in bike sharing data

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
from tlviz.factor_tools import factor_match_score
from wordcloud import WordCloud

import matcouply.decomposition as decomposition
from matcouply.data import get_bike_data

Load the data

This dataset contains three matrices with bike sharing data from three cities in Norway: Oslo, Bergen and Trondheim. Each row of these data matrices represent a station, and each column represent an hour in 2021. The matrix element \(x^{(\text{Oslo})}_{jk}\) is the number of trips that ended in station \(j\) in Oslo during hour \(k\). More information about this dataset is available in the documentation for the get_bike_data-function.

bike_data = get_bike_data()
matrices = [bike_data["oslo"].values, bike_data["bergen"].values, bike_data["trondheim"].values]

Fit non-negative PARAFAC2 models

Let us fit a non-negative PARAFAC2 model to these matrices to extract underlying patterns. We fit five models using different random initializations to avoid bad local minima and to ensure that the model is unique.

all_models = []
all_errors = []
lowest_error = float("inf")
for init in range(5):
    print("-" * 50)
    print("Init:", init)
    print("-" * 50)
    cmf, diagnostics = decomposition.parafac2_aoadmm(
        matrices,
        rank=4,
        non_negative=True,
        n_iter_max=1000,
        tol=1e-8,
        verbose=100,
        return_errors=True,
        random_state=init,
    )

    all_models.append(cmf)
    all_errors.append(diagnostics)

    if diagnostics.regularized_loss[-1] < lowest_error:
        selected_init = init
        lowest_error = diagnostics.regularized_loss[-1]

--------------------------------------------------
Init: 0
--------------------------------------------------
All regularization penalties (including regs list):
* Mode 0:
   - <'matcouply.penalties.NonNegativity' with aux_init='random_uniform', dual_init='random_uniform')>
* Mode 1:
   - <'matcouply.penalties.Parafac2' with svd='truncated_svd', update_basis_matrices=True, update_coordinate_matrix=True, n_iter=1, aux_init='random_uniform', dual_init='random_uniform')>
   - <'matcouply.penalties.NonNegativity' with aux_init='random_uniform', dual_init='random_uniform')>
* Mode 2:
   - <'matcouply.penalties.NonNegativity' with aux_init='random_uniform', dual_init='random_uniform')>
Feasibility gaps for A: [0.4054142384386044]
Feasibility gaps for the Bi-matrices: [0.9925100259153429, 0.7276658585981951]
Feasibility gaps for C: [0.7170566554644632]
Coupled matrix factorization iteration=0, reconstruction error=0.6518931179399983, regularized loss=0.21248231860876624 regularized loss variation=0.48763329828884333.
Feasibility gaps for A: [0.0]
Feasibility gaps for the Bi-matrices: [0.6621001482716514, 0.02760828424897867]
Feasibility gaps for C: [0.012295957425270494]
Coupled matrix factorization iteration=100, reconstruction error=0.5673400427458556, regularized loss=0.16093736205143463 regularized loss variation=5.892840827241757e-07.
Feasibility gaps for A: [0.0]
Feasibility gaps for the Bi-matrices: [9.807297293820137e-06, 4.365206124640187e-06]
Feasibility gaps for C: [1.3403174764560529e-06]
converged in 168 iterations: FEASIBILITY GAP CRITERION AND RELATIVE LOSS CRITERION SATISFIED
--------------------------------------------------
Init: 1
--------------------------------------------------
All regularization penalties (including regs list):
* Mode 0:
   - <'matcouply.penalties.NonNegativity' with aux_init='random_uniform', dual_init='random_uniform')>
* Mode 1:
   - <'matcouply.penalties.Parafac2' with svd='truncated_svd', update_basis_matrices=True, update_coordinate_matrix=True, n_iter=1, aux_init='random_uniform', dual_init='random_uniform')>
   - <'matcouply.penalties.NonNegativity' with aux_init='random_uniform', dual_init='random_uniform')>
* Mode 2:
   - <'matcouply.penalties.NonNegativity' with aux_init='random_uniform', dual_init='random_uniform')>
Feasibility gaps for A: [0.78131763755016]
Feasibility gaps for the Bi-matrices: [0.992680309603253, 0.711831264805992]
Feasibility gaps for C: [0.709439159799734]
Coupled matrix factorization iteration=0, reconstruction error=0.6513487710909589, regularized loss=0.21212761080085119 regularized loss variation=0.5220086265006298.
Feasibility gaps for A: [0.0]
Feasibility gaps for the Bi-matrices: [0.7545453652188518, 0.0900027536673507]
Feasibility gaps for C: [0.009140236391788749]
Coupled matrix factorization iteration=100, reconstruction error=0.5673514287605087, regularized loss=0.16094382185829528 regularized loss variation=1.8190117930732827e-06.
Feasibility gaps for A: [0.0]
Feasibility gaps for the Bi-matrices: [9.3256059171422e-06, 6.0745486213299965e-06]
Feasibility gaps for C: [1.6908877395345425e-06]
Coupled matrix factorization iteration=200, reconstruction error=0.5673379372643657, regularized loss=0.16093616752969264 regularized loss variation=4.9253821556644514e-08.
Feasibility gaps for A: [0.0]
Feasibility gaps for the Bi-matrices: [1.0702565501160036e-06, 2.6578544375650817e-06]
Feasibility gaps for C: [1.978462431413455e-07]
converged in 251 iterations: FEASIBILITY GAP CRITERION AND RELATIVE LOSS CRITERION SATISFIED
--------------------------------------------------
Init: 2
--------------------------------------------------
All regularization penalties (including regs list):
* Mode 0:
   - <'matcouply.penalties.NonNegativity' with aux_init='random_uniform', dual_init='random_uniform')>
* Mode 1:
   - <'matcouply.penalties.Parafac2' with svd='truncated_svd', update_basis_matrices=True, update_coordinate_matrix=True, n_iter=1, aux_init='random_uniform', dual_init='random_uniform')>
   - <'matcouply.penalties.NonNegativity' with aux_init='random_uniform', dual_init='random_uniform')>
* Mode 2:
   - <'matcouply.penalties.NonNegativity' with aux_init='random_uniform', dual_init='random_uniform')>
Feasibility gaps for A: [0.713209392366304]
Feasibility gaps for the Bi-matrices: [0.9909529246397856, 0.7054941909790023]
Feasibility gaps for C: [0.7091226634926251]
Coupled matrix factorization iteration=0, reconstruction error=0.6505893059459343, regularized loss=0.2116332225056063 regularized loss variation=0.5107231608991845.
Feasibility gaps for A: [0.012519804405783517]
Feasibility gaps for the Bi-matrices: [0.7981803831414409, 0.12632738679422534]
Feasibility gaps for C: [0.01147257808862832]
Coupled matrix factorization iteration=100, reconstruction error=0.5674970973146409, regularized loss=0.1610264777302715 regularized loss variation=1.0119606410311852e-05.
Feasibility gaps for A: [0.0]
Feasibility gaps for the Bi-matrices: [1.1741786796722072e-05, 7.643068180559676e-06]
Feasibility gaps for C: [1.3835742937975192e-06]
Coupled matrix factorization iteration=200, reconstruction error=0.5673473747485936, regularized loss=0.1609415218170606 regularized loss variation=1.2227243194832198e-06.
Feasibility gaps for A: [0.0]
Feasibility gaps for the Bi-matrices: [1.979154329422416e-06, 1.1488952184562974e-06]
Feasibility gaps for C: [1.9304172928321133e-07]
Coupled matrix factorization iteration=300, reconstruction error=0.56733780249434, regularized loss=0.16093609106955334 regularized loss variation=3.651255862454094e-08.
Feasibility gaps for A: [0.0]
Feasibility gaps for the Bi-matrices: [2.980157290520802e-07, 1.0590357857431017e-07]
Feasibility gaps for C: [3.085371945683839e-08]
converged in 339 iterations: FEASIBILITY GAP CRITERION AND RELATIVE LOSS CRITERION SATISFIED
--------------------------------------------------
Init: 3
--------------------------------------------------
All regularization penalties (including regs list):
* Mode 0:
   - <'matcouply.penalties.NonNegativity' with aux_init='random_uniform', dual_init='random_uniform')>
* Mode 1:
   - <'matcouply.penalties.Parafac2' with svd='truncated_svd', update_basis_matrices=True, update_coordinate_matrix=True, n_iter=1, aux_init='random_uniform', dual_init='random_uniform')>
   - <'matcouply.penalties.NonNegativity' with aux_init='random_uniform', dual_init='random_uniform')>
* Mode 2:
   - <'matcouply.penalties.NonNegativity' with aux_init='random_uniform', dual_init='random_uniform')>
Feasibility gaps for A: [0.9239105363240266]
Feasibility gaps for the Bi-matrices: [0.99109144334499, 0.690871237429179]
Feasibility gaps for C: [0.7153978251491855]
Coupled matrix factorization iteration=0, reconstruction error=0.6578576883681536, regularized loss=0.21638836907254536 regularized loss variation=0.48298208895897954.
Feasibility gaps for A: [0.010502419497484258]
Feasibility gaps for the Bi-matrices: [0.09354342528809674, 0.0020292742657619343]
Feasibility gaps for C: [0.007600687406894959]
Coupled matrix factorization iteration=100, reconstruction error=0.5673406616990883, regularized loss=0.16093771320857966 regularized loss variation=8.346546199445429e-07.
Feasibility gaps for A: [0.0]
Feasibility gaps for the Bi-matrices: [1.2157971724506902e-05, 5.830196975929153e-06]
Feasibility gaps for C: [2.116393215990601e-06]
converged in 164 iterations: FEASIBILITY GAP CRITERION AND RELATIVE LOSS CRITERION SATISFIED
--------------------------------------------------
Init: 4
--------------------------------------------------
All regularization penalties (including regs list):
* Mode 0:
   - <'matcouply.penalties.NonNegativity' with aux_init='random_uniform', dual_init='random_uniform')>
* Mode 1:
   - <'matcouply.penalties.Parafac2' with svd='truncated_svd', update_basis_matrices=True, update_coordinate_matrix=True, n_iter=1, aux_init='random_uniform', dual_init='random_uniform')>
   - <'matcouply.penalties.NonNegativity' with aux_init='random_uniform', dual_init='random_uniform')>
* Mode 2:
   - <'matcouply.penalties.NonNegativity' with aux_init='random_uniform', dual_init='random_uniform')>
Feasibility gaps for A: [0.7573629294044325]
Feasibility gaps for the Bi-matrices: [0.9935377332354612, 0.7411712941151378]
Feasibility gaps for C: [0.712528179098723]
Coupled matrix factorization iteration=0, reconstruction error=0.6507929020365915, regularized loss=0.21176570067060427 regularized loss variation=0.48062691596101664.
Feasibility gaps for A: [0.0019361917733547059]
Feasibility gaps for the Bi-matrices: [0.328119396835454, 0.009570067648664015]
Feasibility gaps for C: [0.010342676439388174]
Coupled matrix factorization iteration=100, reconstruction error=0.5673410497016232, regularized loss=0.16093793333826986 regularized loss variation=8.09225624033223e-07.
Feasibility gaps for A: [0.0]
Feasibility gaps for the Bi-matrices: [1.0926848717201373e-05, 5.547561109678054e-06]
Feasibility gaps for C: [1.830175272255011e-06]
converged in 171 iterations: FEASIBILITY GAP CRITERION AND RELATIVE LOSS CRITERION SATISFIED

Check uniqueness of the NN-PARAFAC2 models

To check that the model is unique, we check that the initialization runs that reach the same loss also find the same components.

def get_stacked_CP_tensor(cmf):
    weights, factors = cmf
    A, B_is, C = factors

    stacked_cp_tensor = (weights, (A, np.concatenate(B_is, axis=0), C))
    return stacked_cp_tensor


print("Similarity with selected init")
for init, model in enumerate(all_models):
    if init == selected_init:
        print(f"Init {init} is the selected init")
        continue

    fms = factor_match_score(
        get_stacked_CP_tensor(model), get_stacked_CP_tensor(all_models[selected_init]), consider_weights=False
    )
    print(f"Similarity with selected init: {fms:}")


weights, (A, B_is, C) = all_models[selected_init]

Similarity with selected init
Similarity with selected init: 0.9999992387502665
Similarity with selected init: 0.9999933862814362
Similarity with selected init: 0.9999908284278795
Init 3 is the selected init
Similarity with selected init: 0.99999959738201

Convert factor matrices to DataFrame

To make visualization easier, we convert the factor matrices to dataframes with interpretable indices. We also sort the components by weight and clip the factors at 0 since the AO-ADMM algorithm may allow negative values that are very close to 0.

if weights is None:
    weights = 1

norms = np.linalg.norm(A, axis=0) * np.linalg.norm(B_is[0], axis=0) * np.linalg.norm(C, axis=0)
order = np.argsort(-weights * norms)

A = pd.DataFrame(np.maximum(0, A[:, order]), index=["Oslo", "Bergen", "Trondheim"])
B_is = [
    pd.DataFrame(np.maximum(0, B_is[0][:, order]), index=bike_data["oslo"].index),
    pd.DataFrame(np.maximum(0, B_is[1][:, order]), index=bike_data["bergen"].index),
    pd.DataFrame(np.maximum(0, B_is[2][:, order]), index=bike_data["trondheim"].index),
]
C = pd.DataFrame(np.maximum(0, C[:, order]), index=bike_data["oslo"].columns)

Plot the time-components

C_melted = C.melt(value_name="Value", var_name="Component", ignore_index=False).reset_index()
fig = px.line(
    C_melted,
    x="Time of arrival",
    y="Value",
    facet_row="Component",
    hover_data={"Time of arrival": "|%a, %b %e, %H:%M"},
    color="Component",
)
fig

By briefly looking at the time-mode components, we immediately see that the fourth component displays behaviour during summer, when people in Norway typically have their vacation. If we zoom in a bit, we can see interesting behaviour for the first three components too. The first three components are the most active during week-days. The first component likely represents travel home from work, as it is active in the afternoon and the second component likely represents travel too work, as it is active in the morning. The third component however, is active the whole day, but mostly during the afternoon and the morning. Interestingly, the ‘vacation’ component is most active during weekends instead of week days.

Plot the strength of the components in each city

A_melted = A.melt(value_name="Value", var_name="Component", ignore_index=False).reset_index()
A_melted["Component"] = A_melted["Component"].astype(str)  # Force discrete colormap
fig = px.bar(A_melted, x="index", y="Value", facet_row="Component", color="Component")
fig

We see that most of the components are most prominant in Oslo (which is the largest city too), except for the third component, which is mainly prominent in Bergen instead.

Plot the Oslo-station components as a density-map

We can visualize the station components as a density-map by first joining the station mode factor matrices for each city with a dataframe constainting the station coordinates, and then using the density_mapbox-plot from PlotLy Express.

B_0_melted = (
    B_is[0]
    .join(bike_data["station_metadata"])
    .melt(value_name="Value", var_name="Component", ignore_index=False, id_vars=bike_data["station_metadata"].columns)
    .reset_index()
)

fig = px.density_mapbox(
    B_0_melted,
    lat="Arrival station latitude",
    lon="Arrival station longitude",
    z="Value",
    zoom=11,
    opacity=0.5,
    animation_frame="Component",
    animation_group="Arrival station ID",
    hover_data=["Arrival station name"],
    title="Oslo",
)
fig.update_layout(
    mapbox_style="carto-positron",
)
fig

By exploring the map, you can see that the first component (active at the end of workdays) is active in residential areas in parts of the city that are fairly close to the centre. This pattern is expected as people living in these areas are the most likely to have a bike-sharing station close and a short enough commute to bike home from work. Furthermore, the second component (active at the beginning of workdays) is active in more central, high-density areas where offices and universities are located. The third components activation (active during the whole day), is spread throughout the city. Finally, the fourth component (active during weekends in the summer) has activation for stations close to popular swimming areas and areas with a lot of restaurants with outdoor seating.

Plot the Bergen-station components as a density-map

B_1_melted = (
    B_is[1]
    .join(bike_data["station_metadata"])
    .melt(value_name="Value", var_name="Component", ignore_index=False, id_vars=bike_data["station_metadata"].columns)
    .reset_index()
)

fig = px.density_mapbox(
    B_1_melted,
    lat="Arrival station latitude",
    lon="Arrival station longitude",
    z="Value",
    zoom=11,
    opacity=0.5,
    animation_frame="Component",
    animation_group="Arrival station ID",
    hover_data=["Arrival station name"],
    title="Bergen",
)
fig.update_layout(
    mapbox_style="carto-positron",
)
fig

Again, we see that the first component (active at the end of workdays) is active in residential areas near the city centre. The second component (active at the beginning of workdays) is also clearly active near offices and the universities. The third components activation (active during the whole day), is spread throughout the city and residental areas. Finally, the fourth component (active during weekends in the summer) has activation for stations close to popular swimming areas, parks and restaurants with outdoor seating.

Plot the Trondheim-station components as a density-map

B_2_melted = (
    B_is[2]
    .join(bike_data["station_metadata"])
    .melt(value_name="Value", var_name="Component", ignore_index=False, id_vars=bike_data["station_metadata"].columns)
    .reset_index()
)

fig = px.density_mapbox(
    B_2_melted,
    lat="Arrival station latitude",
    lon="Arrival station longitude",
    z="Value",
    zoom=11,
    opacity=0.5,
    animation_frame="Component",
    animation_group="Arrival station ID",
    hover_data=["Arrival station name"],
    title="Trondheim",
)
fig.update_layout(
    mapbox_style="carto-positron",
)
fig

Here, we see the same story as with Oslo and Bergen. Component one is active near residental areas, component two near offices and universities, component three is active throughout the city and component four is active in areas that are popular during the summer.

Plot the station components as word-clouds

n_components = B_is[0].shape[1]
B_is_with_meta = [B_i.join(bike_data["station_metadata"]) for B_i in B_is]

fig, axes = plt.subplots(n_components, 3, figsize=(10, 2 * n_components), dpi=200, squeeze=False)

for r in range(n_components):
    for city_id in range(3):
        wc = WordCloud(background_color="black", max_words=1000, colormap="Pastel1")
        frequencies = B_is_with_meta[city_id].set_index("Arrival station name")[r].to_dict()
        wc.generate_from_frequencies(frequencies)
        axes[r, city_id].imshow(wc, interpolation="bilinear")
        axes[r, city_id].set_xticks([])
        axes[r, city_id].set_yticks([])

axes[0, 0].set_title("Oslo")
axes[0, 1].set_title("Bergen")
axes[0, 2].set_title("Trondheim")
for i in range(4):
    axes[i, 0].set_ylabel(f"Component {i}")
plt.show()

These wordcloud plots confirm the patterns you see on the maps. Stations such as “Bankplassen”, “Nygårdsporten” and “Vollabekken” are close to high density areas with a lot of workplaces for Oslo, Bergen and Trondheim, respectivly. While stations like “Rådhusbrygge 4”, “Festplassen” and “Lade idrettsannlegg vest” are close to popular summer activities.

Total running time of the script: (0 minutes 24.288 seconds)

Gallery generated by Sphinx-Gallery