arrays are dense, continuous, uniformly sized blocks of identically typed data values
import numpy as np
L = [[0,1],[2,3]]
A = np.array(L)
L: [[0, 1], [2, 3]] A: [[0 1] [2 3]]
<class 'list'> <class 'numpy.ndarray'>
In the standard python interpretter, the return value of id
is the memory address of the object.
print(id(L[1])-id(L[0])) #rows are far away
print(id(L[0][1])-id(L[0][0])) #columns not so much, but 32 bytes?
Keeping data close together results in faster access times.
Arrays¶Note that np.ndarray
and np.array
are the same thing.
A = np.array([1,2,3,4])
A.dtype #type of what is stored in the array - NOT python types!
A.ndim #number of dimensions (axes in numpy speak)
A.shape #size of the dimensions as a tuple
A.reshape((4,1)).shape #a column vector
(4, 1)
A = np.array([1,2,3,4]).reshape(4,1)
#can initialize an array with a list, or list of lists (or list of lists of lists, etc)
M = np.array([[1,2,3],[4,5,6.0]])
float64 (2, 3)
#if know the size, but not the data, can initialize to zeros:
Z = np.zeros((10,10))
#or ones
O = np.ones((5,10))
#or identity
I = np.identity(3) #this makes a 3x3 square identity matrix
print(Z.dtype) #note, default type is floating point
Z = np.zeros((10,10),np.int64) #can change
arrays can be indexed and sliced a lot like python lists, but take tuples of values to reference each dimension.
M = np.array([[0,1,2],[3,4,5]])
array([[0, 1, 2], [3, 4, 5]])
print(M[1,1]) #indexing
print(M[0,-1]) #last item of first row
4 2
print(M[0,1:]) #can have slices - all but first column of first row
[1 2]
print(M[1],M[1,:]) #missing indices are treated as complete slices
[3 4 5] [3 4 5]
M = [[0,1,2],[3,4,5]]
arrays support advanced indexing by arrays of integers or booleans:
A = np.array([0,1,4,9,16,25])
print(A[[2,5]]) #choose just indices 2 and 5
[ 4 25]
Indexing by Boolean numpy arrays can be used to select elements
b = A > 4
[False False False True True True]
[ 9 16 25]
print("b =",b)
A[b] = 0
b = [False False False True True True]
[0 1 4 0 0 0]
S = np.array(['a','b','c','b','a'])
S[S != 'a'] = 'z'
array object has a pointer to a dense block of memory that stores the data of the array.A = np.array([[0,1,2],[3,4,5],[6,7,8]])
B = A #A and B reference the _same_ object
A is B
B[0,0] = 1000
array([[1000, 1, 2], [ 3, 4, 5], [ 6, 7, 8]])
row = A[1,:]
array([3, 4, 5])
row[2] = 5000
array([[1000, 1, 2], [ 3, 4, 5000], [ 6, 7, 8]])
newMat = A.copy() #this will actually copy the data
newMat[0,0] = 0
array([[1000, 1, 2], [ 3, 4, 5000], [ 6, 7, 8]])
array([[ 0, 1, 2], [ 3, 4, 5000], [ 6, 7, 8]])
A = np.array([[0,1,2],[3,4,5],[6,7,8]])
B = A[A > 4]
array([5, 6, 7, 8])
B[:] = -1
array([-1, -1, -1, -1])
array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
A[A > 4] = -1
array([[ 0, 1, 2], [ 3, 4, -1], [-1, -1, -1]])
def z(M):
M[:] = 0
A = np.array([1,2,3])
includes a number of standard functions that will work on arrays
A = [1,2,3,4]
array([ 0.84147098, 0.90929743, 0.14112001, -0.7568025 ])
Most aggregation operations take an axis
parameter that limits the operation to a specific direction in the array
b = np.arange(12).reshape(3,4); b
array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]])
array([12, 15, 18, 21])
array([ 6, 22, 38])
(and the simpler loadtxt
) will read in deliminated files.
array([nan, nan, nan, ..., nan, nan, nan])
The defaul delimiter is whitespace which will not work with a csv
array([[ nan, 4.0000000e+01, 5.0000000e+01, ..., 2.4000000e+02, 2.5000000e+02, 2.6000000e+02], [ nan, -7.0000000e-02, -2.3000000e-01, ..., 5.7000000e-01, 0.0000000e+00, 1.0000000e-02], [ nan, 2.1500000e-01, 9.0000000e-02, ..., -1.0000000e-01, 2.7000000e-01, 2.3500001e-01], ..., [ nan, -2.5500000e-01, -3.6000000e-01, ..., 8.4000000e-01, -3.9000000e-01, -4.1500000e-01], [ nan, 5.7000000e-01, 1.2000000e-01, ..., -1.2000000e-01, 6.9000000e-01, 5.5500000e-01], [ nan, 4.0500000e-01, 1.7000000e-01, ..., -8.0000000e-02, 6.5000000e-01, 5.2000000e-01]])
Why nan?
Recall that numpy arrays are dense, uniformly typed arrays. Can't mix a gene name (string) with expression values (float).
strdata = np.genfromtxt('../files/Spellman.csv',dtype=str,delimiter=',')
array([['time', '40', '50', ..., '240', '250', '260'], ['YAL001C', '-0.07', '-0.23', ..., '0.57', '0', '0.01'], ['YAL014C', '0.215', '0.09', ..., '-0.1', '0.27', '0.23500001'], ..., ['YPR201W', '-0.255', '-0.36', ..., '0.84', '-0.39', '-0.415'], ['YPR203W', '0.57', '0.12', ..., '-0.12', '0.69', '0.555'], ['YPR204W', '0.405', '0.17', ..., '-0.08', '0.65', '0.52']], dtype='<U12')
header = strdata[0,1:].astype(int)
genes = strdata[1:,0]
values = strdata[1:,1:].astype(float)
(4382, 24)
array([[0.51995439, 0.50171038, 0.51653364, ..., 0.59293044, 0.52793615, 0.5290764 ], [0.55245154, 0.5381984 , 0.53078677, ..., 0.51653364, 0.55872292, 0.55473204], [0.54503991, 0.54503991, 0.55302166, ..., 0.48916762, 0.55644242, 0.54960091], ..., [0.49885975, 0.48688712, 0.49372862, ..., 0.62371722, 0.48346636, 0.48061574], [0.59293044, 0.54161916, 0.51995439, ..., 0.51425314, 0.60661345, 0.59122007], [0.57411631, 0.54732041, 0.52280502, ..., 0.51881414, 0.60205245, 0.58722919]])
import matplotlib.pyplot as plt
#bins = [-3,-2,-1,0,1,2,3]
#bins = np.linspace(-3,3,100)
bins = np.linspace(-3,3,100)
plt.hist(values[:,0],bins=bins, alpha=0.5,label="ts-40")
plt.xlabel("Expression", size=14)
plt.ylabel("Number of Instances", size=14)
Text(0, 0.5, 'Number of Instances')
plt.ylabel("Avg. Expression",size=14);
plt.ylabel("Avg. Expression",size=14)
<matplotlib.legend.Legend at 0x120b7e3d0>