Example datasets

Functions:

get_bike_data()

Get bike sharing data from three major Norwegian cities

get_semiconductor_etch_machine_data([...])

Load machine measurements from the semiconductor etch dataset from [WGB+99].

get_semiconductor_etch_raw_data([...])

Load semiconductor etch data from [WGB+99].

get_simple_simulated_data([noise_level, ...])

Generate a simple simulated dataset with shifting unimodal \(\mathbf{B}^{(i)}\) matrices.

matcouply.data.get_bike_data()[source]

Get bike sharing data from three major Norwegian cities

This dataset contains three matrices with bike sharing data from Oslo, Bergen and Trondheim, \(\mathbf{X}^{(\text{Oslo})}, \mathbf{X}^{(\text{Bergen})}\) and \(\mathbf{X}^{(\text{Trondheim})}\). Each row of these data matrices represent a station, and each column of the data matrices represent an hour in 2021. The matrix element \(x^{(\text{Oslo})}_{jk}\) is the number of trips that ended in station \(j\) in Oslo during hour \(k\).

The data was obtained using the open API of

on the 23rd of November 2021.

The dataset is cleaned so it only contains for the dates in 2021 when bike sharing was open in all three cities (2021-04-07 - 2021-11-23).

Returns:

  • dict – Dictionary mapping the city name with a data frame that contain bike sharing data from that city. There is also an additional "station_metadata"-key, which maps to a data frame with additional station metadata. This metadata is useful for interpreting the extracted components.

  • .. note:: – The original bike sharing data is released under a NLOD lisence (https://data.norge.no/nlod/en/2.0/).

matcouply.data.get_semiconductor_etch_machine_data(download_data=True, save_data=True)[source]

Load machine measurements from the semiconductor etch dataset from [WGB+99].

This function will load the semiconductor etch machine data and prepare it for analysis.

If the dataset is already downloaded on your computer, then the local files will be loaded. Otherwise, they will be downloaded. By default, the files are downloaded from https://eigenvector.com/data/Etch.

Parameters:
  • download_data (bool) – If False, then an error will be raised if the data is not already downloaded.

  • save_data (bool) – if True, then the data will be stored locally to avoid having to download multiple times.

Returns:

Dictionary where the keys are the dataset names and the values are the contents of the MATLAB files.

Return type:

dict

matcouply.data.get_semiconductor_etch_raw_data(download_data=True, save_data=True)[source]

Load semiconductor etch data from [WGB+99].

If the dataset is already downloaded on your computer, then the local files will be loaded. Otherwise, they will be downloaded. By default, the files are downloaded from https://eigenvector.com/data/Etch.

Parameters:
  • download_data (bool) – If False, then an error will be raised if the data is not already downloaded.

  • save_data (bool) – if True, then the data will be stored locally to avoid having to download multiple times.

Returns:

Dictionary where the keys are the dataset names and the values are the contents of the MATLAB files.

Return type:

dict

matcouply.data.get_simple_simulated_data(noise_level=0.2, random_state=1)[source]

Generate a simple simulated dataset with shifting unimodal \(\mathbf{B}^{(i)}\) matrices.

The entries in \(\mathbf{A}\) (or \(\mathbf{D}^{(i)}\)-matrices) are uniformly distributed between 0.1 and 1.1. This is done to ensure that there is signal from all components in all matrices.

The component vectors in the \(\mathbf{B}^{(i)}\) matrices are Gaussian probability density functions that shift one entry for each matrix. This means that they are non-negative, unimodal and satisfy the PARAFAC2 constraint.

The entries in \(\mathbf{C}\) are truncated normal distributed, and are therefore sparse.

The dataset is generated by constructing the matrices represented by the decomposition and adding noise according to

\[\mathbf{M}_\text{noisy}^{(i)} = \mathbf{M}^{(i)} + \eta \frac{\|\mathbf{\mathcal{M}}\|}{\|\mathbf{\mathcal{N}}\|} \mathbf{N}^{(i)},\]

where \(\eta\) is the noise level, \(\mathbf{M}^{(i)}\) is the \(i\)-th matrix represented by the simulated factorization, \(\mathbf{\mathcal{M}}\) is the tensor obtained by stacking all the \(\mathbf{M}^{(i)}\)-matrices, \(n^{(i)}_{jk} \sim \mathcal{N}(0, 1)\) and \(\mathbf{N}^{(i)}\) and \(\mathbf{\mathcal{N}}\) and is the matrix and tensor with elements given by \(n^{(i)}_{jk}\), respectively.

Parameters:
  • noise_level (float) – Strength of noise added to the matrices

  • random_state (None, int or valid tensorly random state) –

Returns:

  • list of matrices – The noisy matrices

  • CoupledMatrixFactorization – The factorization that underlies the simulated data matrices