Electronic Medical Records with Pandas¶

09/28/2023¶

print view
notebook

No description has been provided for this image
In [1]:
%%html
<script src="https://bits.csb.pitt.edu/preamble.js"></script>

https://www.npr.org/sections/health-shots/2015/10/30/452853785/medical-students-crunch-big-data-to-spot-health-trends

The Data¶

Download this (and, optionally, unzip it): https://bits.csb.pitt.edu/files/Hospital_Inpatient_Discharges__SPARCS_De-Identified___2014.csv.gz

Go here: https://navigator.med.nyu.edu/ace/sparcs/

In [2]:
f = open('./Hospital_Inpatient_Discharges__SPARCS_De-Identified___2014.csv')
i = 0
for line in f:
    row = line.split(',')
    print(len(row))
    i += 1
    if i > 10:
        break
37
38
37
38
38
40
37
37
38
37
37

What's wrong? Why (look at the data)? How do we fix it?

Possible Fix¶

Use the python csv module to read the file.

In [3]:
import csv
f = open('./Hospital_Inpatient_Discharges__SPARCS_De-Identified___2014.csv')
reader = csv.reader(f)
i = 0
for row in reader:
    print(len(row))
    i += 1
    if i > 10:
        break
37
37
37
37
37
37
37
37
37
37
37

or...

pandas¶

https://pandas.pydata.org/index.html

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It can also read and store compressed data automatically, which is a big deal.

Basically, if you have a big spreadsheet of data with mixed types.

In [4]:
#usecols only extracts those columns - much faster
import pandas as pd
data = pd.read_csv('./Hospital_Inpatient_Discharges__SPARCS_De-Identified___2014.csv.gz', usecols=[3,7,10,14,15,16,17,29,30,35])
/var/folders/s6/r0hv0w_j5pnbjrpqcsn9sq8m0000gp/T/ipykernel_32823/2813859563.py:3: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
  data = pd.read_csv('./Hospital_Inpatient_Discharges__SPARCS_De-Identified___2014.csv.gz', usecols=[3,7,10,14,15,16,17,29,30,35])

Note: This is a large dataset so I am preemptively selecting a subset of columns. Try to avoid loading this data more than once as it may take up all your memory and you'll have to restart the python kernel

In [5]:
data
Out[5]:
Facility ID Gender Length of Stay CCS Diagnosis Code CCS Diagnosis Description CCS Procedure Code CCS Procedure Description Attending Provider License Number Operating Provider License Number Total Charges
0 37.0 F 8 197 Skin and subcutaneous tissue infections 0 NO PROC 90335341.0 NaN $9546.85
1 37.0 F 3 146 Diverticulosis and diverticulitis 0 NO PROC 90335341.0 NaN $11462.75
2 37.0 M 1 50 Diabetes mellitus with complications 202 ELECTROCARDIOGRAM 90335341.0 167816.0 $1609.40
3 37.0 F 1 154 Noninfectious gastroenteritis 202 ELECTROCARDIOGRAM 90335341.0 167816.0 $2638.75
4 37.0 F 3 124 Acute and chronic tonsillitis 0 NO PROC 90335341.0 NaN $3538.25
... ... ... ... ... ... ... ... ... ... ...
2367278 943.0 F 1 245 Syncope 0 NO PROC 156102.0 NaN $10074.00
2367279 943.0 M 1 149 Biliary tract disease 0 NO PROC 267443.0 NaN $21252.00
2367280 943.0 F 1 102 Nonspecific chest pain 0 NO PROC 267443.0 NaN $11673.00
2367281 943.0 F 1 660 Alcohol-related disorders 171 SUTURE SKIN/SUBCUT TISS 267443.0 249597.0 $16722.00
2367282 943.0 M 1 2 Septicemia (except in labor) 0 NO PROC 251080.0 NaN $11637.00

2367283 rows × 10 columns

In [7]:
data.to_numpy()
Out[7]:
array([[37.0, 'F', '8', ..., 90335341.0, nan, '$9546.85'],
       [37.0, 'F', '3', ..., 90335341.0, nan, '$11462.75'],
       [37.0, 'M', '1', ..., 90335341.0, 167816.0, '$1609.40'],
       ...,
       [943.0, 'F', '1', ..., 267443.0, nan, '$11673.00'],
       [943.0, 'F', '1', ..., 267443.0, 249597.0, '$16722.00'],
       [943.0, 'M', '1', ..., 251080.0, nan, '$11637.00']], dtype=object)

Data Access¶

Columns (and rows) have names that you can use to access them.

In [8]:
data.columns
Out[8]:
Index(['Facility ID', 'Gender', 'Length of Stay', 'CCS Diagnosis Code',
       'CCS Diagnosis Description', 'CCS Procedure Code',
       'CCS Procedure Description', 'Attending Provider License Number',
       'Operating Provider License Number', 'Total Charges'],
      dtype='object')
In [9]:
data.Gender  #get's gender column
Out[9]:
0          F
1          F
2          M
3          F
4          F
          ..
2367278    F
2367279    M
2367280    F
2367281    F
2367282    M
Name: Gender, Length: 2367283, dtype: object

Data Access¶

In [10]:
data["CCS Diagnosis Code"][:3] # can't use dot syntax if column name has spaces
Out[10]:
0    197
1    146
2     50
Name: CCS Diagnosis Code, dtype: int64

[] slices by rows, but indexes by column name - must provide range or it interprets the index as a column label.

In [11]:
data[:1]
Out[11]:
Facility ID Gender Length of Stay CCS Diagnosis Code CCS Diagnosis Description CCS Procedure Code CCS Procedure Description Attending Provider License Number Operating Provider License Number Total Charges
0 37.0 F 8 197 Skin and subcutaneous tissue infections 0 NO PROC 90335341.0 NaN $9546.85
In [12]:
data[0] 
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/indexes/base.py:3790, in Index.get_loc(self, key)
   3789 try:
-> 3790     return self._engine.get_loc(casted_key)
   3791 except KeyError as err:

File index.pyx:152, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:181, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[12], line 1
----> 1 data[0] 

File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/frame.py:3896, in DataFrame.__getitem__(self, key)
   3894 if self.columns.nlevels > 1:
   3895     return self._getitem_multilevel(key)
-> 3896 indexer = self.columns.get_loc(key)
   3897 if is_integer(indexer):
   3898     indexer = [indexer]

File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/indexes/base.py:3797, in Index.get_loc(self, key)
   3792     if isinstance(casted_key, slice) or (
   3793         isinstance(casted_key, abc.Iterable)
   3794         and any(isinstance(x, slice) for x in casted_key)
   3795     ):
   3796         raise InvalidIndexError(key)
-> 3797     raise KeyError(key) from err
   3798 except TypeError:
   3799     # If we have a listlike key, _check_indexing_error will raise
   3800     #  InvalidIndexError. Otherwise we fall through and re-raise
   3801     #  the TypeError.
   3802     self._check_indexing_error(key)

KeyError: 0

iloc: Position indexing¶

If you want to reference a pandas data frame with position based indexing, use .iloc - work's just like numpy

In [13]:
data.iloc[3,1]
Out[13]:
'F'
In [14]:
data.iloc[0,:]
Out[14]:
Facility ID                                                             37.0
Gender                                                                     F
Length of Stay                                                             8
CCS Diagnosis Code                                                       197
CCS Diagnosis Description            Skin and subcutaneous tissue infections
CCS Procedure Code                                                         0
CCS Procedure Description                                            NO PROC
Attending Provider License Number                                 90335341.0
Operating Provider License Number                                        NaN
Total Charges                                                       $9546.85
Name: 0, dtype: object

Pandas uses NaN to indicate missing data

In [15]:
%%html
<div id="pdiloc" style="width: 500px"></div>
<script>
$('head').append('<link rel="stylesheet" href="https://bits.csb.pitt.edu/asker.js/themes/asker.default.css" />');

    var divid = '#pdiloc';
	jQuery(divid).asker({
	    id: divid,
	    question: "What does data.iloc[0] return?",
		answers: ["First row",'First column','37','Facility ID','Error'],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();


</script>
In [16]:
data.iloc[0]
Out[16]:
Facility ID                                                             37.0
Gender                                                                     F
Length of Stay                                                             8
CCS Diagnosis Code                                                       197
CCS Diagnosis Description            Skin and subcutaneous tissue infections
CCS Procedure Code                                                         0
CCS Procedure Description                                            NO PROC
Attending Provider License Number                                 90335341.0
Operating Provider License Number                                        NaN
Total Charges                                                       $9546.85
Name: 0, dtype: object
In [17]:
data.iloc[0]
Out[17]:
Facility ID                                                             37.0
Gender                                                                     F
Length of Stay                                                             8
CCS Diagnosis Code                                                       197
CCS Diagnosis Description            Skin and subcutaneous tissue infections
CCS Procedure Code                                                         0
CCS Procedure Description                                            NO PROC
Attending Provider License Number                                 90335341.0
Operating Provider License Number                                        NaN
Total Charges                                                       $9546.85
Name: 0, dtype: object

.loc: Label indexing¶

You can also index by the label names. Note the rows are being indexed by their named index.

In [18]:
data.loc[:3,'Gender']
Out[18]:
0    F
1    F
2    M
3    F
Name: Gender, dtype: object
In [19]:
data.loc[[0,3,5],['CCS Diagnosis Description','CCS Diagnosis Code']]
Out[19]:
CCS Diagnosis Description CCS Diagnosis Code
0 Skin and subcutaneous tissue infections 197
3 Noninfectious gastroenteritis 154
5 Influenza 123
In [20]:
data[10:20]
Out[20]:
Facility ID Gender Length of Stay CCS Diagnosis Code CCS Diagnosis Description CCS Procedure Code CCS Procedure Description Attending Provider License Number Operating Provider License Number Total Charges
10 37.0 F 3 123 Influenza 0 NO PROC 90335341.0 NaN $4566.15
11 37.0 M 7 122 Pneumonia (except that caused by tuberculosis ... 202 ELECTROCARDIOGRAM 90335341.0 167816.0 $9822.90
12 37.0 M 3 122 Pneumonia (except that caused by tuberculosis ... 0 NO PROC 90335341.0 NaN $5063.05
13 37.0 M 2 155 Other gastrointestinal disorders 0 NO PROC 90335341.0 NaN $3125.75
14 37.0 F 3 122 Pneumonia (except that caused by tuberculosis ... 0 NO PROC 90335341.0 NaN $5055.45
15 37.0 F 6 127 Chronic obstructive pulmonary disease and bron... 0 NO PROC 90335341.0 NaN $9734.05
16 37.0 F 4 127 Chronic obstructive pulmonary disease and bron... 0 NO PROC 90335341.0 NaN $7168.05
17 37.0 F 2 197 Skin and subcutaneous tissue infections 0 NO PROC 90301264.0 NaN $2812.85
18 37.0 M 3 58 Other nutritional; endocrine; and metabolic di... 0 NO PROC 90335341.0 NaN $3377.50
19 37.0 F 2 125 Acute bronchitis 0 NO PROC 90335341.0 NaN $3214.25
In [21]:
data[10:20].loc[[0,3,5]]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[21], line 1
----> 1 data[10:20].loc[[0,3,5]]

File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/indexing.py:1153, in _LocationIndexer.__getitem__(self, key)
   1150 axis = self.axis or 0
   1152 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1153 return self._getitem_axis(maybe_callable, axis=axis)

File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/indexing.py:1382, in _LocIndexer._getitem_axis(self, key, axis)
   1379     if hasattr(key, "ndim") and key.ndim > 1:
   1380         raise ValueError("Cannot index with multidimensional key")
-> 1382     return self._getitem_iterable(key, axis=axis)
   1384 # nested tuple slicing
   1385 if is_nested_tuple(key, labels):

File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/indexing.py:1322, in _LocIndexer._getitem_iterable(self, key, axis)
   1319 self._validate_key(key, axis)
   1321 # A collection of keys
-> 1322 keyarr, indexer = self._get_listlike_indexer(key, axis)
   1323 return self.obj._reindex_with_indexers(
   1324     {axis: [keyarr, indexer]}, copy=True, allow_dups=True
   1325 )

File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/indexing.py:1520, in _LocIndexer._get_listlike_indexer(self, key, axis)
   1517 ax = self.obj._get_axis(axis)
   1518 axis_name = self.obj._get_axis_name(axis)
-> 1520 keyarr, indexer = ax._get_indexer_strict(key, axis_name)
   1522 return keyarr, indexer

File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/indexes/base.py:6114, in Index._get_indexer_strict(self, key, axis_name)
   6111 else:
   6112     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6114 self._raise_if_missing(keyarr, indexer, axis_name)
   6116 keyarr = self.take(indexer)
   6117 if isinstance(key, Index):
   6118     # GH 42790 - Preserve name from an Index

File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/indexes/base.py:6175, in Index._raise_if_missing(self, key, indexer, axis_name)
   6173     if use_interval_msg:
   6174         key = list(key)
-> 6175     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6177 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   6178 raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index([0, 3, 5], dtype='int64')] are in the [index]"
In [22]:
data[10:20].iloc[[0,3,5]]
Out[22]:
Facility ID Gender Length of Stay CCS Diagnosis Code CCS Diagnosis Description CCS Procedure Code CCS Procedure Description Attending Provider License Number Operating Provider License Number Total Charges
10 37.0 F 3 123 Influenza 0 NO PROC 90335341.0 NaN $4566.15
13 37.0 M 2 155 Other gastrointestinal disorders 0 NO PROC 90335341.0 NaN $3125.75
15 37.0 F 6 127 Chronic obstructive pulmonary disease and bron... 0 NO PROC 90335341.0 NaN $9734.05

Boolean Indexing¶

Just like numpy we can index by a boolean array or an array of indices.

In [23]:
data[data.Gender == 'M'][:3]
Out[23]:
Facility ID Gender Length of Stay CCS Diagnosis Code CCS Diagnosis Description CCS Procedure Code CCS Procedure Description Attending Provider License Number Operating Provider License Number Total Charges
2 37.0 M 1 50 Diabetes mellitus with complications 202 ELECTROCARDIOGRAM 90335341.0 167816.0 $1609.40
6 37.0 M 5 122 Pneumonia (except that caused by tuberculosis ... 0 NO PROC 90335341.0 NaN $6148.10
7 37.0 M 3 123 Influenza 0 NO PROC 90335341.0 NaN $4204.15
In [24]:
data.iloc[:,[0,3,5]][:3]
Out[24]:
Facility ID CCS Diagnosis Code CCS Procedure Code
0 37.0 197 0
1 37.0 146 0
2 37.0 50 202

Sorting¶

In [25]:
data.sort_values(by="Total Charges").head(3)
Out[25]:
Facility ID Gender Length of Stay CCS Diagnosis Code CCS Diagnosis Description CCS Procedure Code CCS Procedure Description Attending Provider License Number Operating Provider License Number Total Charges
1426998 1439.0 M 1 108 Congestive heart failure; nonhypertensive 0 NO PROC 251948.0 NaN $0.50
780065 989.0 M 2 657 Mood disorders 218 PSYCHO/PSYCHI EVAL/THER 199003.0 154247.0 $1.00
781531 989.0 F 1 661 Substance-related disorders 0 NO PROC 145418.0 NaN $1.00
In [26]:
data.sort_values(by="Total Charges",ascending=False).head(3)
Out[26]:
Facility ID Gender Length of Stay CCS Diagnosis Code CCS Diagnosis Description CCS Procedure Code CCS Procedure Description Attending Provider License Number Operating Provider License Number Total Charges
77722 213.0 F 10 100 Acute myocardial infarction 44 COR ARTERY BYP GRF-CABG 224769.0 224769.0 $99999.65
451015 635.0 F 1 115 Aortic; peripheral; and visceral artery aneurysms 52 AORTIC RESECTION; REPL 232988.0 232988.0 $99999.46
1579047 1456.0 M 12 2 Septicemia (except in labor) 157 AMPUTATE LOWER EXTRMITY 258717.0 265448.0 $99998.45
In [27]:
%%html
<div id="mostexp" style="width: 500px"></div>
<script>
$('head').append('<link rel="stylesheet" href="https://bits.csb.pitt.edu/asker.js/themes/asker.default.css" />');

    var divid = '#mostexp';
	jQuery(divid).asker({
	    id: divid,
	    question: "Is $99999.65 the most expensive record?",
		answers: ["Yes",'Yes','No','No'],
        extra: ["It's at the top of the sorted records, don't make me second-guess myself.","I'm not very familiar with the American health care system so this already seems like an insanely high cost","And I can tell you why not","Because you wouldn't ask if Yes was the answer"],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();


</script>

String Methods¶

Can apply standard string functions to all cells. This returns the changed value; it does not mutate in place.

In [28]:
data['Total Charges'] = data['Total Charges'].str.lstrip('$').astype(float)

The above overwrites the previous Total Charges column to be a floating point number instead of a string with a dollar sign.

Correct sorting¶

In [29]:
data.sort_values(by="Total Charges",ascending=False).head(3)
Out[29]:
Facility ID Gender Length of Stay CCS Diagnosis Code CCS Diagnosis Description CCS Procedure Code CCS Procedure Description Attending Provider License Number Operating Provider License Number Total Charges
967254 1169.0 M 120 + 63 Diseases of white blood cells 64 BONE MARROW TRANSPLANT 198304.0 229870.0 8593455.88
1560003 1456.0 F 120 + 143 Abdominal hernia 86 OTHER HERNIA REPAIR 165181.0 165181.0 6272871.31
957685 1169.0 F 120 + 6 Hepatitis 176 OT ORGAN TRANSPLANTATN 236414.0 183253.0 5745201.42

Creating New Columns¶

In [30]:
# setting errors='coerce' below will result in NaN values for invalid input
data['Length of Stay'] = pd.to_numeric(data['Length of Stay'], errors='coerce')
data['Length of Stay'].fillna(120,inplace=True)
In [31]:
data['Charge per day'] = data['Total Charges']/data['Length of Stay']
In [32]:
data
Out[32]:
Facility ID Gender Length of Stay CCS Diagnosis Code CCS Diagnosis Description CCS Procedure Code CCS Procedure Description Attending Provider License Number Operating Provider License Number Total Charges Charge per day
0 37.0 F 8.0 197 Skin and subcutaneous tissue infections 0 NO PROC 90335341.0 NaN 9546.85 1193.356250
1 37.0 F 3.0 146 Diverticulosis and diverticulitis 0 NO PROC 90335341.0 NaN 11462.75 3820.916667
2 37.0 M 1.0 50 Diabetes mellitus with complications 202 ELECTROCARDIOGRAM 90335341.0 167816.0 1609.40 1609.400000
3 37.0 F 1.0 154 Noninfectious gastroenteritis 202 ELECTROCARDIOGRAM 90335341.0 167816.0 2638.75 2638.750000
4 37.0 F 3.0 124 Acute and chronic tonsillitis 0 NO PROC 90335341.0 NaN 3538.25 1179.416667
... ... ... ... ... ... ... ... ... ... ... ...
2367278 943.0 F 1.0 245 Syncope 0 NO PROC 156102.0 NaN 10074.00 10074.000000
2367279 943.0 M 1.0 149 Biliary tract disease 0 NO PROC 267443.0 NaN 21252.00 21252.000000
2367280 943.0 F 1.0 102 Nonspecific chest pain 0 NO PROC 267443.0 NaN 11673.00 11673.000000
2367281 943.0 F 1.0 660 Alcohol-related disorders 171 SUTURE SKIN/SUBCUT TISS 267443.0 249597.0 16722.00 16722.000000
2367282 943.0 M 1.0 2 Septicemia (except in labor) 0 NO PROC 251080.0 NaN 11637.00 11637.000000

2367283 rows × 11 columns

In [33]:
pd.to_numeric('120 +',errors='coerce')
Out[33]:
nan
In [34]:
pd.to_numeric('120 +',errors='ignore')
Out[34]:
'120 +'
In [35]:
pd.to_numeric('120 +')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File lib.pyx:2368, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "120 +"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[35], line 1
----> 1 pd.to_numeric('120 +')

File /opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/tools/numeric.py:222, in to_numeric(arg, errors, downcast, dtype_backend)
    220 coerce_numeric = errors not in ("ignore", "raise")
    221 try:
--> 222     values, new_mask = lib.maybe_convert_numeric(  # type: ignore[call-overload]  # noqa: E501
    223         values,
    224         set(),
    225         coerce_numeric=coerce_numeric,
    226         convert_to_masked_nullable=dtype_backend is not lib.no_default
    227         or isinstance(values_dtype, StringDtype),
    228     )
    229 except (ValueError, TypeError):
    230     if errors == "raise":

File lib.pyx:2410, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "120 +" at position 0

Group by¶

Group records that have the same value for a column

In [36]:
data.groupby('Facility ID')
Out[36]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x2ecbe84c0>

We can then apply an aggregating function to these groups.

In [38]:
#note: make sure mean() is only applied to columns with numerical values, or you will get an error
data.groupby('Facility ID').mean(numeric_only=True).sort_values(by='Total Charges')
Out[38]:
Length of Stay CCS Diagnosis Code CCS Procedure Code Attending Provider License Number Operating Provider License Number Total Charges Charge per day
Facility ID
377.0 2.670732 121.170732 202.439024 4.664694e+07 4.826434e+07 4060.734024 1614.759785
111.0 3.611111 159.777778 0.000000 1.929986e+05 NaN 4492.587778 1473.003327
37.0 3.162791 123.906977 54.779070 8.298531e+07 7.666251e+06 4939.879651 1628.579552
9250.0 2.203456 205.840173 137.955076 1.082429e+07 1.264187e+07 5618.928860 2497.869519
165.0 4.597884 180.206349 181.920635 2.051072e+05 4.744077e+07 6376.906455 1580.321499
... ... ... ... ... ... ... ...
563.0 6.279619 129.367643 78.085210 2.363745e+05 6.581608e+05 89385.986450 20937.033431
1446.0 5.633545 214.859847 157.190406 3.497396e+05 3.955800e+05 124204.521077 36290.348526
1139.0 8.937250 205.031446 87.044577 3.341266e+05 3.847957e+05 131992.454991 17629.688410
1138.0 38.527005 215.072013 213.108020 1.963923e+05 1.966056e+05 175997.805859 4195.405483
1486.0 61.821429 134.039683 194.801587 1.863599e+05 1.921974e+05 195780.272579 2486.885826

215 rows × 7 columns

Example¶

In [39]:
data.groupby('Gender').mean(numeric_only=True).sort_values(by='Total Charges').loc[:,['Total Charges']]
Out[39]:
Total Charges
Gender
U 9858.198939
F 36867.585805
M 44419.344760

The group by column has become an indexing column. Need to reset_index to convert index to columns.

In [40]:
data.groupby('Gender').mean(numeric_only=True).sort_values(by='Total Charges').reset_index().loc[:,['Gender','Total Charges']]
Out[40]:
Gender Total Charges
0 U 9858.198939
1 F 36867.585805
2 M 44419.344760

Example¶

In [41]:
data.groupby(['CCS Procedure Description','Gender'])['Total Charges'].mean()
Out[41]:
CCS Procedure Description  Gender
ABDOMINAL PARACENTESIS     F         56996.536195
                           M         58532.559231
ABORTION (TERM OF PREG)    F         26512.622947
ALCO/DRUG REHAB/DETOX      F         18954.578164
                           M         18187.073614
                                         ...     
UPPER GI X-RAY             M         31221.460246
URETERAL CATHETERIZATN     F         36143.329481
                           M         34182.121658
VARI VEIN STRIP;LOW LMB    F         42519.938571
                           M         65763.081875
Name: Total Charges, Length: 446, dtype: float64

unstack Pivot a level of the (necessarily hierarchical) index labels.

In [42]:
gendercosts = data.groupby(['CCS Procedure Description','Gender'])['Total Charges'].mean().unstack()
gendercosts
Out[42]:
Gender F M U
CCS Procedure Description
ABDOMINAL PARACENTESIS 56996.536195 58532.559231 NaN
ABORTION (TERM OF PREG) 26512.622947 NaN NaN
ALCO/DRUG REHAB/DETOX 18954.578164 18187.073614 NaN
AMPUTATE LOWER EXTRMITY 100926.639344 92821.328596 NaN
AORTIC RESECTION; REPL 144019.629159 114210.484050 NaN
... ... ... ...
UNGROUPABLE 68101.627500 105811.388636 NaN
UP GASTRO ENDOSC/BIOPSY 42625.453678 43340.006917 NaN
UPPER GI X-RAY 28339.654074 31221.460246 NaN
URETERAL CATHETERIZATN 36143.329481 34182.121658 NaN
VARI VEIN STRIP;LOW LMB 42519.938571 65763.081875 NaN

233 rows × 3 columns

In [43]:
genderdiffs = gendercosts.F - gendercosts.M
genderdiffs
Out[43]:
CCS Procedure Description
ABDOMINAL PARACENTESIS     -1536.023036
ABORTION (TERM OF PREG)             NaN
ALCO/DRUG REHAB/DETOX        767.504550
AMPUTATE LOWER EXTRMITY     8105.310748
AORTIC RESECTION; REPL     29809.145110
                               ...     
UNGROUPABLE               -37709.761136
UP GASTRO ENDOSC/BIOPSY     -714.553239
UPPER GI X-RAY             -2881.806172
URETERAL CATHETERIZATN      1961.207823
VARI VEIN STRIP;LOW LMB   -23243.143304
Length: 233, dtype: float64
In [44]:
genderdiffs.dropna(inplace=True)  #remove rows with NaN, modify genderdiffs in place
genderdiffs
Out[44]:
CCS Procedure Description
ABDOMINAL PARACENTESIS     -1536.023036
ALCO/DRUG REHAB/DETOX        767.504550
AMPUTATE LOWER EXTRMITY     8105.310748
AORTIC RESECTION; REPL     29809.145110
APPENDECTOMY                 209.527573
                               ...     
UNGROUPABLE               -37709.761136
UP GASTRO ENDOSC/BIOPSY     -714.553239
UPPER GI X-RAY             -2881.806172
URETERAL CATHETERIZATN      1961.207823
VARI VEIN STRIP;LOW LMB   -23243.143304
Length: 202, dtype: float64
In [45]:
genderdiffs.sort_values().head(5)
Out[45]:
CCS Procedure Description
EXTRA CIRC AUX OPEN HRT   -69453.636115
DX PRCS ON EYE            -63462.345439
SWAN-GANZ CATH MONITOR    -46738.846837
PRCS ON SPLEEN            -46219.301962
CORNEAL TRANSPLANT        -45028.501148
dtype: float64
In [46]:
genderdiffs.sort_values().tail(5)
Out[46]:
CCS Procedure Description
LENS & CATARACT PRCS       35378.240602
OT INTRAOCULAR THER PRC    35907.146236
BONE MARROW TRANSPLANT     37066.859998
DES LES RETINA/CHOROID     52461.621223
CORONARY THROMBOLYSIS      87870.550000
dtype: float64

Combining DataFrames¶

pd.concat concatenates rows (i.e., default axis=0) while merging columns with the same name.

In [47]:
df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)

df2 = pd.DataFrame(
    {
        "B": ["B2", "B3", "B6", "B7"],
        "D": ["D2", "D3", "D6", "D7"],
        "F": ["F2", "F3", "F6", "F7"],
    },
    index=[2, 3, 6, 7],
)
In [48]:
df1
Out[48]:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
In [49]:
df2
Out[49]:
B D F
2 B2 D2 F2
3 B3 D3 F3
6 B6 D6 F6
7 B7 D7 F7
In [50]:
pd.concat([df1,df2])
Out[50]:
A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 NaN
3 A3 B3 C3 D3 NaN
2 NaN B2 NaN D2 F2
3 NaN B3 NaN D3 F3
6 NaN B6 NaN D6 F6
7 NaN B7 NaN D7 F7

Join¶

You can join two tables on a specific column (or columns). Rows that has the same value (or key) in that column will be combined.

  • inner join - key must exist in both tables
  • outer join - key can exist in either table
  • left join - key must exist in left table
  • right join - key must exist in right table
In [51]:
left = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)


right = pd.DataFrame(
    {
        "key": [ "K1", "K2", "K4"],
        "C": ["C1", "C2", "C4"],
        "D": ["D1", "D2", "D4"],
    }
)
In [52]:
left
Out[52]:
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
In [53]:
right
Out[53]:
key C D
0 K1 C1 D1
1 K2 C2 D2
2 K4 C4 D4

Inner Join¶

In [54]:
pd.merge(left,right,on='key',how='inner')
Out[54]:
key A B C D
0 K1 A1 B1 C1 D1
1 K2 A2 B2 C2 D2

Outer Join¶

In [55]:
pd.merge(left,right,on='key',how='outer')
Out[55]:
key A B C D
0 K0 A0 B0 NaN NaN
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 NaN NaN
4 K4 NaN NaN C4 D4

Left Join¶

In [101]:
pd.merge(left,right,on='key',how='left')
Out[101]:
key A B C D
0 K0 A0 B0 NaN NaN
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 NaN NaN

Questions¶

Download: https://bits.csb.pitt.edu/files/Hospital_Inpatient_Discharges__SPARCS_De-Identified___2014.csv.gz

  • How many data records are there?

  • How many coronary bypasses are there?

  • What is the average cost? Standard deviation?

  • What is the most common diagnosis that leads to a coronary bypass?

  • What percent of people with that diagnosis get a coronary bypass?

  • What are the facilities whose average cost for this operation is in the top 10%? Bottom 10%?

  • How correlated is the length of stay to the cost?

  • Is the percentage of people who go to these facilities with the most common diagnosis and receive a coronary bypass significantly different between these two groups?

  • What about knee replacements?

  • How well can a decision tree predict the cost of the operation? What are the most important features?