mercredi 25 novembre 2020

Seaborn data visualization misunderstanding of densities?

I was playing around with the seaborn library for data visualization and trying to display a standard normal distribution. The basics in this case look something like:

import numpy as np
import seaborn as sns

n=1000
N= np.random.randn(n)
fig=sns.displot(N,kind="kde")

Which behaves as expected. My problem starts when I try to plot multiple distributions at the same time. I tried the brute N_2= np.random.randn(n/2) and fig=sns.displot((N,N2),kind="kde"), which returns two distributions (as wanted), but the one with smaller sample size is significantly different (and flatter). Regardless of the sample size, a proper density plot (or histogram) should have the area below the graph equal to one, but this is clearly not the case.

Knowing that seaborn works with pandas Dataframes, I've tried with the more elaborate (and generally bad and inefficient, but I hope clear) code below to attempt again multiple distributions on the same graph:

import numpy as np
import seaborn as sns
import pandas as pd
n=10000

N_1= np.reshape(np.random.randn(n),(n,1))
N_2= np.reshape(np.random.randn(int(n/2)),(int(n/2),1))
N_3= np.reshape(np.random.randn(int(n/4)),(int(n/4),1))

A_1 = np.reshape(np.array(['n1' for _ in range(n)]),(n,1))
A_2 = np.reshape(np.array(['n2' for _ in range(int(n/2))]),(int(n/2),1))
A_3 = np.reshape(np.array(['n3' for _ in range(int(n/4))]),(int(n/4),1))

F_1=np.concatenate((N_1,A_1),1)
F_2=np.concatenate((N_2,A_2),1)
F_3=np.concatenate((N_3,A_3),1)

F= pd.DataFrame(data=np.concatenate((F_1,F_2,F_3),0),columns=["datar","cat"])
F["datar"]=F.datar.astype('float')
fig=sns.displot(F,x="datar",hue="cat",kind="kde")

Which shows again very different (almost scaled) distributions, confirming that the result in this case is not consistent with what I was expecting (namely, roughly overlapping distributions). Am I not understanding how this graph works? There is a completely different approach to draw multiple distributions on the same graph that I am missing?




Aucun commentaire:

Enregistrer un commentaire