Basic Statistics¶

This file contains basic descriptive functions. Included are the mean, median, mode, moving average, standard deviation, and the variance. When calling a function on data, there are checks for functions already defined for that data type.

The mean() function returns the arithmetic mean (the sum of all the members of a list, divided by the number of members). Further revisions may include the geometric and harmonic mean. The median() function returns the number separating the higher half of a sample from the lower half. The mode() returns the most common occurring member of a sample, plus the number of times it occurs. If entries occur equally common, the smallest of a list of the most common entries is returned. The moving_average() is a finite impulse response filter, creating a series of averages using a user-defined number of subsets of the full data set. The std() and the variance() return a measurement of how far data points tend to be from the arithmetic mean.

Functions are available in the namespace stats, i.e. you can use them by typing stats.mean, stats.median, etc.

REMARK: If all the data you are working with are floating point numbers, you may find stats.TimeSeries helpful, since it is extremely fast and offers many of the same descriptive statistics as in the module.

AUTHOR:

Andrew Hou (11/06/2009)

sage.stats.basic_stats.mean(v)[source]¶

Return the mean of the elements of \(v\).

We define the mean of the empty list to be the (symbolic) NaN, following the convention of MATLAB, Scipy, and R.

This function is deprecated. Use numpy.mean() or numpy.nanmean() instead.

INPUT:

v – list of numbers

OUTPUT: a number

EXAMPLES:

Sage

sage: mean([pi, e])                                                             # needs sage.symbolic
doctest:warning...
DeprecationWarning: sage.stats.basic_stats.mean is deprecated;
use numpy.mean or numpy.nanmean instead
See https://github.com/sagemath/sage/issues/29662 for details.
1/2*pi + 1/2*e
sage: mean([])                                                                  # needs sage.symbolic
NaN
sage: mean([I, sqrt(2), 3/5])                                                   # needs sage.symbolic
1/3*sqrt(2) + 1/3*I + 1/5
sage: mean([RIF(1.0103,1.0103), RIF(2)])                                        # needs sage.rings.real_interval_field
1.5051500000000000?
sage: mean(range(4))
3/2
sage: v = stats.TimeSeries([1..100])                                            # needs numpy
sage: mean(v)                                                                   # needs numpy
50.5

Python

>>> from sage.all import *
>>> mean([pi, e])                                                             # needs sage.symbolic
doctest:warning...
DeprecationWarning: sage.stats.basic_stats.mean is deprecated;
use numpy.mean or numpy.nanmean instead
See https://github.com/sagemath/sage/issues/29662 for details.
1/2*pi + 1/2*e
>>> mean([])                                                                  # needs sage.symbolic
NaN
>>> mean([I, sqrt(Integer(2)), Integer(3)/Integer(5)])                                                   # needs sage.symbolic
1/3*sqrt(2) + 1/3*I + 1/5
>>> mean([RIF(RealNumber('1.0103'),RealNumber('1.0103')), RIF(Integer(2))])                                        # needs sage.rings.real_interval_field
1.5051500000000000?
>>> mean(range(Integer(4)))
3/2
>>> v = stats.TimeSeries((ellipsis_range(Integer(1),Ellipsis,Integer(100))))                                            # needs numpy
>>> mean(v)                                                                   # needs numpy
50.5

sage.stats.basic_stats.median(v)[source]¶

Return the median (middle value) of the elements of \(v\).

If \(v\) is empty, we define the median to be NaN, which is consistent with NumPy (note that R returns NULL). If \(v\) is comprised of strings, TypeError occurs. For elements other than numbers, the median is a result of sorted().

This function is deprecated. Use numpy.median() or numpy.nanmedian() instead.

INPUT:

v – list

OUTPUT: median element of \(v\)

EXAMPLES:

Sage

sage: median([1,2,3,4,5])
doctest:warning...
DeprecationWarning: sage.stats.basic_stats.median is deprecated;
use numpy.median or numpy.nanmedian instead
See https://github.com/sagemath/sage/issues/29662 for details.
3
sage: median([e, pi])                                                           # needs sage.symbolic
1/2*pi + 1/2*e
sage: median(['sage', 'linux', 'python'])
'python'
sage: median([])                                                                # needs sage.symbolic
NaN
sage: class MyClass:
....:    def median(self):
....:       return 1
sage: stats.median(MyClass())
1

Python

>>> from sage.all import *
>>> median([Integer(1),Integer(2),Integer(3),Integer(4),Integer(5)])
doctest:warning...
DeprecationWarning: sage.stats.basic_stats.median is deprecated;
use numpy.median or numpy.nanmedian instead
See https://github.com/sagemath/sage/issues/29662 for details.
3
>>> median([e, pi])                                                           # needs sage.symbolic
1/2*pi + 1/2*e
>>> median(['sage', 'linux', 'python'])
'python'
>>> median([])                                                                # needs sage.symbolic
NaN
>>> class MyClass:
...    def median(self):
...       return Integer(1)
>>> stats.median(MyClass())
1

sage.stats.basic_stats.mode(v)[source]¶

Return the mode of \(v\).

The mode is the list of the most frequently occurring elements in \(v\). If \(n\) is the most times that any element occurs in \(v\), then the mode is the list of elements of \(v\) that occur \(n\) times. The list is sorted if possible.

This function is deprecated. Use scipy.stats.mode() or statistics.mode() instead.

Note

The elements of \(v\) must be hashable.

INPUT:

v – list

OUTPUT: list (sorted if possible)

EXAMPLES:

Sage

sage: v = [1,2,4,1,6,2,6,7,1]
sage: mode(v)
doctest:warning...
DeprecationWarning: sage.stats.basic_stats.mode is deprecated;
use scipy.stats.mode or statistics.mode instead
See https://github.com/sagemath/sage/issues/29662 for details.
[1]
sage: v.count(1)
3
sage: mode([])
[]

sage: mode([1,2,3,4,5])
[1, 2, 3, 4, 5]
sage: mode([3,1,2,1,2,3])
[1, 2, 3]
sage: mode([0, 2, 7, 7, 13, 20, 2, 13])
[2, 7, 13]

sage: mode(['sage', 'four', 'I', 'three', 'sage', 'pi'])
['sage']

sage: class MyClass:
....:   def mode(self):
....:       return [1]
sage: stats.mode(MyClass())
[1]

Python

>>> from sage.all import *
>>> v = [Integer(1),Integer(2),Integer(4),Integer(1),Integer(6),Integer(2),Integer(6),Integer(7),Integer(1)]
>>> mode(v)
doctest:warning...
DeprecationWarning: sage.stats.basic_stats.mode is deprecated;
use scipy.stats.mode or statistics.mode instead
See https://github.com/sagemath/sage/issues/29662 for details.
[1]
>>> v.count(Integer(1))
3
>>> mode([])
[]

>>> mode([Integer(1),Integer(2),Integer(3),Integer(4),Integer(5)])
[1, 2, 3, 4, 5]
>>> mode([Integer(3),Integer(1),Integer(2),Integer(1),Integer(2),Integer(3)])
[1, 2, 3]
>>> mode([Integer(0), Integer(2), Integer(7), Integer(7), Integer(13), Integer(20), Integer(2), Integer(13)])
[2, 7, 13]

>>> mode(['sage', 'four', 'I', 'three', 'sage', 'pi'])
['sage']

>>> class MyClass:
...   def mode(self):
...       return [Integer(1)]
>>> stats.mode(MyClass())
[1]

sage.stats.basic_stats.moving_average(v, n)[source]¶

Return the moving average of a list \(v\).

The moving average of a list is often used to smooth out noisy data.

If \(v\) is empty, we define the entries of the moving average to be NaN.

This method is deprecated. Use pandas.Series.rolling() instead.

INPUT:

v – list
n – the number of values used in computing each average

OUTPUT: list of length len(v)-n+1, since we do not fabric any values

EXAMPLES:

Sage

sage: moving_average([1..10], 1)
doctest:warning...
DeprecationWarning: sage.stats.basic_stats.moving_average is deprecated;
use pandas.Series.rolling instead
See https://github.com/sagemath/sage/issues/29662 for details.
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
sage: moving_average([1..10], 4)
[5/2, 7/2, 9/2, 11/2, 13/2, 15/2, 17/2]
sage: moving_average([], 1)
[]
sage: moving_average([pi, e, I, sqrt(2), 3/5], 2)                               # needs sage.symbolic
[1/2*pi + 1/2*e, 1/2*e + 1/2*I, 1/2*sqrt(2) + 1/2*I,
 1/2*sqrt(2) + 3/10]

Python

>>> from sage.all import *
>>> moving_average((ellipsis_range(Integer(1),Ellipsis,Integer(10))), Integer(1))
doctest:warning...
DeprecationWarning: sage.stats.basic_stats.moving_average is deprecated;
use pandas.Series.rolling instead
See https://github.com/sagemath/sage/issues/29662 for details.
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> moving_average((ellipsis_range(Integer(1),Ellipsis,Integer(10))), Integer(4))
[5/2, 7/2, 9/2, 11/2, 13/2, 15/2, 17/2]
>>> moving_average([], Integer(1))
[]
>>> moving_average([pi, e, I, sqrt(Integer(2)), Integer(3)/Integer(5)], Integer(2))                               # needs sage.symbolic
[1/2*pi + 1/2*e, 1/2*e + 1/2*I, 1/2*sqrt(2) + 1/2*I,
 1/2*sqrt(2) + 3/10]

We check if the input is a time series, and if so use the optimized simple_moving_average() method, but with (slightly different) meaning as defined above (the point is that the simple_moving_average() on time series returns \(n\) values:

Sage

sage: a = stats.TimeSeries([1..10])                                             # needs numpy
sage: stats.moving_average(a, 3)                                                # needs numpy
[2.0000, 3.0000, 4.0000, 5.0000, 6.0000, 7.0000, 8.0000, 9.0000]
sage: stats.moving_average(list(a), 3)                                          # needs numpy
[2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

Python

>>> from sage.all import *
>>> a = stats.TimeSeries((ellipsis_range(Integer(1),Ellipsis,Integer(10))))                                             # needs numpy
>>> stats.moving_average(a, Integer(3))                                                # needs numpy
[2.0000, 3.0000, 4.0000, 5.0000, 6.0000, 7.0000, 8.0000, 9.0000]
>>> stats.moving_average(list(a), Integer(3))                                          # needs numpy
[2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

sage.stats.basic_stats.std(v, bias=False)[source]¶

Return the standard deviation of the elements of \(v\).

We define the standard deviation of the empty list to be NaN, following the convention of MATLAB, Scipy, and R.

This function is deprecated. Use numpy.std() or numpy.nanstd() instead.

INPUT:

v – list of numbers
bias – boolean (default: False); if False, divide by len(v) - 1 instead of len(v) to give a less biased estimator (sample) for the standard deviation.

OUTPUT: a number

EXAMPLES:

Sage

sage: # needs sage.symbolic
sage: std([1..6], bias=True)
doctest:warning...
DeprecationWarning: sage.stats.basic_stats.std is deprecated;
use numpy.std or numpy.nanstd instead
See https://github.com/sagemath/sage/issues/29662 for details.
doctest:warning...
DeprecationWarning: sage.stats.basic_stats.variance is deprecated;
use numpy.var or numpy.nanvar instead
See https://github.com/sagemath/sage/issues/29662 for details.
doctest:warning...
DeprecationWarning: sage.stats.basic_stats.mean is deprecated;
use numpy.mean or numpy.nanmean instead
See https://github.com/sagemath/sage/issues/29662 for details.
1/2*sqrt(35/3)
sage: std([1..6], bias=False)
sqrt(7/2)
sage: std([e, pi])
sqrt(1/2)*abs(pi - e)
sage: std([])
NaN
sage: std([I, sqrt(2), 3/5])
1/15*sqrt(1/2)*sqrt((10*sqrt(2) - 5*I - 3)^2
+ (5*sqrt(2) - 10*I + 3)^2 + (5*sqrt(2) + 5*I - 6)^2)
sage: std([RIF(1.0103, 1.0103), RIF(2)])
0.6998235813403261?

sage: # needs numpy
sage: import numpy
sage: if int(numpy.version.short_version[0]) > 1:
....:     _ = numpy.set_printoptions(legacy="1.25")
sage: x = numpy.array([1,2,3,4,5])
sage: std(x, bias=False)
1.5811388300841898
sage: x = stats.TimeSeries([1..100])
sage: std(x)
29.011491975882016

Python

>>> from sage.all import *
>>> # needs sage.symbolic
>>> std((ellipsis_range(Integer(1),Ellipsis,Integer(6))), bias=True)
doctest:warning...
DeprecationWarning: sage.stats.basic_stats.std is deprecated;
use numpy.std or numpy.nanstd instead
See https://github.com/sagemath/sage/issues/29662 for details.
doctest:warning...
DeprecationWarning: sage.stats.basic_stats.variance is deprecated;
use numpy.var or numpy.nanvar instead
See https://github.com/sagemath/sage/issues/29662 for details.
doctest:warning...
DeprecationWarning: sage.stats.basic_stats.mean is deprecated;
use numpy.mean or numpy.nanmean instead
See https://github.com/sagemath/sage/issues/29662 for details.
1/2*sqrt(35/3)
>>> std((ellipsis_range(Integer(1),Ellipsis,Integer(6))), bias=False)
sqrt(7/2)
>>> std([e, pi])
sqrt(1/2)*abs(pi - e)
>>> std([])
NaN
>>> std([I, sqrt(Integer(2)), Integer(3)/Integer(5)])
1/15*sqrt(1/2)*sqrt((10*sqrt(2) - 5*I - 3)^2
+ (5*sqrt(2) - 10*I + 3)^2 + (5*sqrt(2) + 5*I - 6)^2)
>>> std([RIF(RealNumber('1.0103'), RealNumber('1.0103')), RIF(Integer(2))])
0.6998235813403261?

>>> # needs numpy
>>> import numpy
>>> if int(numpy.version.short_version[Integer(0)]) > Integer(1):
...     _ = numpy.set_printoptions(legacy="1.25")
>>> x = numpy.array([Integer(1),Integer(2),Integer(3),Integer(4),Integer(5)])
>>> std(x, bias=False)
1.5811388300841898
>>> x = stats.TimeSeries((ellipsis_range(Integer(1),Ellipsis,Integer(100))))
>>> std(x)
29.011491975882016

sage.stats.basic_stats.variance(v, bias=False)[source]¶

Return the variance of the elements of \(v\).

We define the variance of the empty list to be NaN, following the convention of MATLAB, Scipy, and R.

This function is deprecated. Use numpy.var() or numpy.nanvar() instead.

INPUT:

v – list of numbers
bias – boolean (default: False); if False, divide by len(v) - 1 instead of len(v) to give a less biased estimator (sample) for the standard deviation.

OUTPUT: a number

EXAMPLES:

Sage

sage: variance([1..6])
doctest:warning...
DeprecationWarning: sage.stats.basic_stats.variance is deprecated;
use numpy.var or numpy.nanvar instead
See https://github.com/sagemath/sage/issues/29662 for details.
7/2
sage: variance([1..6], bias=True)
35/12
sage: variance([e, pi])                                                         # needs sage.symbolic
1/2*(pi - e)^2
sage: variance([])
NaN
sage: variance([I, sqrt(2), 3/5])                                               # needs sage.symbolic
1/450*(10*sqrt(2) - 5*I - 3)^2 + 1/450*(5*sqrt(2) - 10*I + 3)^2
+ 1/450*(5*sqrt(2) + 5*I - 6)^2
sage: variance([RIF(1.0103, 1.0103), RIF(2)])
0.4897530450000000?
sage: import numpy                                                              # needs numpy
sage: if int(numpy.version.short_version[0]) > 1:                               # needs numpy
....:     _ = numpy.set_printoptions(legacy="1.25")                                 # needs numpy
sage: x = numpy.array([1,2,3,4,5])                                              # needs numpy
sage: variance(x, bias=False)                                                   # needs numpy
2.5
sage: x = stats.TimeSeries([1..100])
sage: variance(x)
841.6666666666666
sage: variance(x, bias=True)
833.25
sage: class MyClass:
....:   def variance(self, bias=False):
....:      return 1
sage: stats.variance(MyClass())
1
sage: class SillyPythonList:
....:   def __init__(self):
....:       self.__list = [2, 4]
....:   def __len__(self):
....:       return len(self.__list)
....:   def __iter__(self):
....:       return self.__list.__iter__()
....:   def mean(self):
....:       return 3
sage: R = SillyPythonList()
sage: variance(R)
2
sage: variance(R, bias=True)
1

Python

>>> from sage.all import *
>>> variance((ellipsis_range(Integer(1),Ellipsis,Integer(6))))
doctest:warning...
DeprecationWarning: sage.stats.basic_stats.variance is deprecated;
use numpy.var or numpy.nanvar instead
See https://github.com/sagemath/sage/issues/29662 for details.
7/2
>>> variance((ellipsis_range(Integer(1),Ellipsis,Integer(6))), bias=True)
35/12
>>> variance([e, pi])                                                         # needs sage.symbolic
1/2*(pi - e)^2
>>> variance([])
NaN
>>> variance([I, sqrt(Integer(2)), Integer(3)/Integer(5)])                                               # needs sage.symbolic
1/450*(10*sqrt(2) - 5*I - 3)^2 + 1/450*(5*sqrt(2) - 10*I + 3)^2
+ 1/450*(5*sqrt(2) + 5*I - 6)^2
>>> variance([RIF(RealNumber('1.0103'), RealNumber('1.0103')), RIF(Integer(2))])
0.4897530450000000?
>>> import numpy                                                              # needs numpy
>>> if int(numpy.version.short_version[Integer(0)]) > Integer(1):                               # needs numpy
...     _ = numpy.set_printoptions(legacy="1.25")                                 # needs numpy
>>> x = numpy.array([Integer(1),Integer(2),Integer(3),Integer(4),Integer(5)])                                              # needs numpy
>>> variance(x, bias=False)                                                   # needs numpy
2.5
>>> x = stats.TimeSeries((ellipsis_range(Integer(1),Ellipsis,Integer(100))))
>>> variance(x)
841.6666666666666
>>> variance(x, bias=True)
833.25
>>> class MyClass:
...   def variance(self, bias=False):
...      return Integer(1)
>>> stats.variance(MyClass())
1
>>> class SillyPythonList:
...   def __init__(self):
...       self.__list = [Integer(2), Integer(4)]
...   def __len__(self):
...       return len(self.__list)
...   def __iter__(self):
...       return self.__list.__iter__()
...   def mean(self):
...       return Integer(3)
>>> R = SillyPythonList()
>>> variance(R)
2
>>> variance(R, bias=True)
1