{ "cells": [ { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Electronic Medical Records with Pandas\n", "## 09/28/2023\n", "\n", "print view
\n", "notebook\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "https://www.npr.org/sections/health-shots/2015/10/30/452853785/medical-students-crunch-big-data-to-spot-health-trends" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# The Data\n", "\n", "Download this (and, optionally, unzip it):\n", "https://bits.csb.pitt.edu/files/Hospital_Inpatient_Discharges__SPARCS_De-Identified___2014.csv.gz\n", "\n", "Go here:\n", "https://navigator.med.nyu.edu/ace/sparcs/" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "editable": true, "slideshow": { "slide_type": "skip" }, "tags": [] }, "outputs": [], "source": [ "!wget https://bits.csb.pitt.edu/files/Hospital_Inpatient_Discharges__SPARCS_De-Identified___2014.csv.gz" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "37\n", "38\n", "37\n", "38\n", "38\n", "40\n", "37\n", "37\n", "38\n", "37\n", "37\n" ] } ], "source": [ "f = open('./Hospital_Inpatient_Discharges__SPARCS_De-Identified___2014.csv')\n", "i = 0\n", "for line in f:\n", " row = line.split(',')\n", " print(len(row))\n", " i += 1\n", " if i > 10:\n", " break" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "What's wrong? Why (look at the data)? How do we fix it?" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Possible Fix\n", "Use the python [csv](https://docs.python.org/2/library/csv.html) module to read the file." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "37\n", "37\n", "37\n", "37\n", "37\n", "37\n", "37\n", "37\n", "37\n", "37\n", "37\n" ] } ], "source": [ "import csv\n", "f = open('./Hospital_Inpatient_Discharges__SPARCS_De-Identified___2014.csv')\n", "reader = csv.reader(f)\n", "i = 0\n", "for row in reader:\n", " print(len(row))\n", " i += 1\n", " if i > 10:\n", " break" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "*or...*" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# `pandas`\n", "\n", "https://pandas.pydata.org/index.html\n", "\n", "pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It can also read and store compressed data automatically, which is a **big deal**.\n", "\n", "Basically, if you have a big spreadsheet of data with **mixed types**.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/s6/r0hv0w_j5pnbjrpqcsn9sq8m0000gp/T/ipykernel_32823/2813859563.py:3: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.\n", " data = pd.read_csv('./Hospital_Inpatient_Discharges__SPARCS_De-Identified___2014.csv.gz', usecols=[3,7,10,14,15,16,17,29,30,35])\n" ] } ], "source": [ "#usecols only extracts those columns - much faster\n", "import pandas as pd\n", "data = pd.read_csv('./Hospital_Inpatient_Discharges__SPARCS_De-Identified___2014.csv.gz', usecols=[3,7,10,14,15,16,17,29,30,35])" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "**Note:** This is a *large* dataset so I am preemptively selecting a subset of columns. Try to avoid loading this data more than once as it may take up all your memory and you'll have to restart the python kernel" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Facility IDGenderLength of StayCCS Diagnosis CodeCCS Diagnosis DescriptionCCS Procedure CodeCCS Procedure DescriptionAttending Provider License NumberOperating Provider License NumberTotal Charges
037.0F8197Skin and subcutaneous tissue infections0NO PROC90335341.0NaN$9546.85
137.0F3146Diverticulosis and diverticulitis0NO PROC90335341.0NaN$11462.75
237.0M150Diabetes mellitus with complications202ELECTROCARDIOGRAM90335341.0167816.0$1609.40
337.0F1154Noninfectious gastroenteritis202ELECTROCARDIOGRAM90335341.0167816.0$2638.75
437.0F3124Acute and chronic tonsillitis0NO PROC90335341.0NaN$3538.25
.................................
2367278943.0F1245Syncope0NO PROC156102.0NaN$10074.00
2367279943.0M1149Biliary tract disease0NO PROC267443.0NaN$21252.00
2367280943.0F1102Nonspecific chest pain0NO PROC267443.0NaN$11673.00
2367281943.0F1660Alcohol-related disorders171SUTURE SKIN/SUBCUT TISS267443.0249597.0$16722.00
2367282943.0M12Septicemia (except in labor)0NO PROC251080.0NaN$11637.00
\n", "

2367283 rows × 10 columns

\n", "
" ], "text/plain": [ " Facility ID Gender Length of Stay CCS Diagnosis Code \\\n", "0 37.0 F 8 197 \n", "1 37.0 F 3 146 \n", "2 37.0 M 1 50 \n", "3 37.0 F 1 154 \n", "4 37.0 F 3 124 \n", "... ... ... ... ... \n", "2367278 943.0 F 1 245 \n", "2367279 943.0 M 1 149 \n", "2367280 943.0 F 1 102 \n", "2367281 943.0 F 1 660 \n", "2367282 943.0 M 1 2 \n", "\n", " CCS Diagnosis Description CCS Procedure Code \\\n", "0 Skin and subcutaneous tissue infections 0 \n", "1 Diverticulosis and diverticulitis 0 \n", "2 Diabetes mellitus with complications 202 \n", "3 Noninfectious gastroenteritis 202 \n", "4 Acute and chronic tonsillitis 0 \n", "... ... ... \n", "2367278 Syncope 0 \n", "2367279 Biliary tract disease 0 \n", "2367280 Nonspecific chest pain 0 \n", "2367281 Alcohol-related disorders 171 \n", "2367282 Septicemia (except in labor) 0 \n", "\n", " CCS Procedure Description Attending Provider License Number \\\n", "0 NO PROC 90335341.0 \n", "1 NO PROC 90335341.0 \n", "2 ELECTROCARDIOGRAM 90335341.0 \n", "3 ELECTROCARDIOGRAM 90335341.0 \n", "4 NO PROC 90335341.0 \n", "... ... ... \n", "2367278 NO PROC 156102.0 \n", "2367279 NO PROC 267443.0 \n", "2367280 NO PROC 267443.0 \n", "2367281 SUTURE SKIN/SUBCUT TISS 267443.0 \n", "2367282 NO PROC 251080.0 \n", "\n", " Operating Provider License Number Total Charges \n", "0 NaN $9546.85 \n", "1 NaN $11462.75 \n", "2 167816.0 $1609.40 \n", "3 167816.0 $2638.75 \n", "4 NaN $3538.25 \n", "... ... ... \n", "2367278 NaN $10074.00 \n", "2367279 NaN $21252.00 \n", "2367280 NaN $11673.00 \n", "2367281 249597.0 $16722.00 \n", "2367282 NaN $11637.00 \n", "\n", "[2367283 rows x 10 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "editable": true, "slideshow": { "slide_type": "skip" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array([[37.0, 'F', '8', ..., 90335341.0, nan, '$9546.85'],\n", " [37.0, 'F', '3', ..., 90335341.0, nan, '$11462.75'],\n", " [37.0, 'M', '1', ..., 90335341.0, 167816.0, '$1609.40'],\n", " ...,\n", " [943.0, 'F', '1', ..., 267443.0, nan, '$11673.00'],\n", " [943.0, 'F', '1', ..., 267443.0, 249597.0, '$16722.00'],\n", " [943.0, 'M', '1', ..., 251080.0, nan, '$11637.00']], dtype=object)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.to_numpy()" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Data Access\n", "\n", "Columns (and rows) have names that you can use to access them." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "Index(['Facility ID', 'Gender', 'Length of Stay', 'CCS Diagnosis Code',\n", " 'CCS Diagnosis Description', 'CCS Procedure Code',\n", " 'CCS Procedure Description', 'Attending Provider License Number',\n", " 'Operating Provider License Number', 'Total Charges'],\n", " dtype='object')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.columns" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 F\n", "1 F\n", "2 M\n", "3 F\n", "4 F\n", " ..\n", "2367278 F\n", "2367279 M\n", "2367280 F\n", "2367281 F\n", "2367282 M\n", "Name: Gender, Length: 2367283, dtype: object" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.Gender #get's gender column" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Data Access" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 197\n", "1 146\n", "2 50\n", "Name: CCS Diagnosis Code, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[\"CCS Diagnosis Code\"][:3] # can't use dot syntax if column name has spaces" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "`[]` slices by rows, but *indexes* by column name - must provide range or it interprets the index as a column label." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Facility IDGenderLength of StayCCS Diagnosis CodeCCS Diagnosis DescriptionCCS Procedure CodeCCS Procedure DescriptionAttending Provider License NumberOperating Provider License NumberTotal Charges
037.0F8197Skin and subcutaneous tissue infections0NO PROC90335341.0NaN$9546.85
\n", "
" ], "text/plain": [ " Facility ID Gender Length of Stay CCS Diagnosis Code \\\n", "0 37.0 F 8 197 \n", "\n", " CCS Diagnosis Description CCS Procedure Code \\\n", "0 Skin and subcutaneous tissue infections 0 \n", "\n", " CCS Procedure Description Attending Provider License Number \\\n", "0 NO PROC 90335341.0 \n", "\n", " Operating Provider License Number Total Charges \n", "0 NaN $9546.85 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[:1]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "ename": "KeyError", "evalue": "0", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "File \u001b[0;32m/opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/indexes/base.py:3790\u001b[0m, in \u001b[0;36mIndex.get_loc\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 3789\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m-> 3790\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_engine\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_loc\u001b[49m\u001b[43m(\u001b[49m\u001b[43mcasted_key\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 3791\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n", "File \u001b[0;32mindex.pyx:152\u001b[0m, in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", "File \u001b[0;32mindex.pyx:181\u001b[0m, in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", "File \u001b[0;32mpandas/_libs/hashtable_class_helper.pxi:7080\u001b[0m, in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", "File \u001b[0;32mpandas/_libs/hashtable_class_helper.pxi:7088\u001b[0m, in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", "\u001b[0;31mKeyError\u001b[0m: 0", "\nThe above exception was the direct cause of the following exception:\n", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[12], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mdata\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m]\u001b[49m \n", "File \u001b[0;32m/opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/frame.py:3896\u001b[0m, in \u001b[0;36mDataFrame.__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 3894\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcolumns\u001b[38;5;241m.\u001b[39mnlevels \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[1;32m 3895\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_getitem_multilevel(key)\n\u001b[0;32m-> 3896\u001b[0m indexer \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcolumns\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_loc\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 3897\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m is_integer(indexer):\n\u001b[1;32m 3898\u001b[0m indexer \u001b[38;5;241m=\u001b[39m [indexer]\n", "File \u001b[0;32m/opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/indexes/base.py:3797\u001b[0m, in \u001b[0;36mIndex.get_loc\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 3792\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(casted_key, \u001b[38;5;28mslice\u001b[39m) \u001b[38;5;129;01mor\u001b[39;00m (\n\u001b[1;32m 3793\u001b[0m \u001b[38;5;28misinstance\u001b[39m(casted_key, abc\u001b[38;5;241m.\u001b[39mIterable)\n\u001b[1;32m 3794\u001b[0m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28many\u001b[39m(\u001b[38;5;28misinstance\u001b[39m(x, \u001b[38;5;28mslice\u001b[39m) \u001b[38;5;28;01mfor\u001b[39;00m x \u001b[38;5;129;01min\u001b[39;00m casted_key)\n\u001b[1;32m 3795\u001b[0m ):\n\u001b[1;32m 3796\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m InvalidIndexError(key)\n\u001b[0;32m-> 3797\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(key) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01merr\u001b[39;00m\n\u001b[1;32m 3798\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m:\n\u001b[1;32m 3799\u001b[0m \u001b[38;5;66;03m# If we have a listlike key, _check_indexing_error will raise\u001b[39;00m\n\u001b[1;32m 3800\u001b[0m \u001b[38;5;66;03m# InvalidIndexError. Otherwise we fall through and re-raise\u001b[39;00m\n\u001b[1;32m 3801\u001b[0m \u001b[38;5;66;03m# the TypeError.\u001b[39;00m\n\u001b[1;32m 3802\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_check_indexing_error(key)\n", "\u001b[0;31mKeyError\u001b[0m: 0" ] } ], "source": [ "data[0] " ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# `iloc`: Position indexing\n", "\n", "If you want to reference a pandas data frame with position based indexing, use .iloc - work's just like `numpy`" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'F'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[3,1]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "Facility ID 37.0\n", "Gender F\n", "Length of Stay 8\n", "CCS Diagnosis Code 197\n", "CCS Diagnosis Description Skin and subcutaneous tissue infections\n", "CCS Procedure Code 0\n", "CCS Procedure Description NO PROC\n", "Attending Provider License Number 90335341.0\n", "Operating Provider License Number NaN\n", "Total Charges $9546.85\n", "Name: 0, dtype: object" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[0,:]" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Pandas uses NaN to indicate missing data" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "Facility ID 37.0\n", "Gender F\n", "Length of Stay 8\n", "CCS Diagnosis Code 197\n", "CCS Diagnosis Description Skin and subcutaneous tissue infections\n", "CCS Procedure Code 0\n", "CCS Procedure Description NO PROC\n", "Attending Provider License Number 90335341.0\n", "Operating Provider License Number NaN\n", "Total Charges $9546.85\n", "Name: 0, dtype: object" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[0]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "editable": true, "slideshow": { "slide_type": "subslide" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "Facility ID 37.0\n", "Gender F\n", "Length of Stay 8\n", "CCS Diagnosis Code 197\n", "CCS Diagnosis Description Skin and subcutaneous tissue infections\n", "CCS Procedure Code 0\n", "CCS Procedure Description NO PROC\n", "Attending Provider License Number 90335341.0\n", "Operating Provider License Number NaN\n", "Total Charges $9546.85\n", "Name: 0, dtype: object" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[0]" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# `.loc`: Label indexing\n", "\n", "You can also index by the label names. Note the rows are being indexed by their named index." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 F\n", "1 F\n", "2 M\n", "3 F\n", "Name: Gender, dtype: object" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.loc[:3,'Gender']" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CCS Diagnosis DescriptionCCS Diagnosis Code
0Skin and subcutaneous tissue infections197
3Noninfectious gastroenteritis154
5Influenza123
\n", "
" ], "text/plain": [ " CCS Diagnosis Description CCS Diagnosis Code\n", "0 Skin and subcutaneous tissue infections 197\n", "3 Noninfectious gastroenteritis 154\n", "5 Influenza 123" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.loc[[0,3,5],['CCS Diagnosis Description','CCS Diagnosis Code']]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Facility IDGenderLength of StayCCS Diagnosis CodeCCS Diagnosis DescriptionCCS Procedure CodeCCS Procedure DescriptionAttending Provider License NumberOperating Provider License NumberTotal Charges
1037.0F3123Influenza0NO PROC90335341.0NaN$4566.15
1137.0M7122Pneumonia (except that caused by tuberculosis ...202ELECTROCARDIOGRAM90335341.0167816.0$9822.90
1237.0M3122Pneumonia (except that caused by tuberculosis ...0NO PROC90335341.0NaN$5063.05
1337.0M2155Other gastrointestinal disorders0NO PROC90335341.0NaN$3125.75
1437.0F3122Pneumonia (except that caused by tuberculosis ...0NO PROC90335341.0NaN$5055.45
1537.0F6127Chronic obstructive pulmonary disease and bron...0NO PROC90335341.0NaN$9734.05
1637.0F4127Chronic obstructive pulmonary disease and bron...0NO PROC90335341.0NaN$7168.05
1737.0F2197Skin and subcutaneous tissue infections0NO PROC90301264.0NaN$2812.85
1837.0M358Other nutritional; endocrine; and metabolic di...0NO PROC90335341.0NaN$3377.50
1937.0F2125Acute bronchitis0NO PROC90335341.0NaN$3214.25
\n", "
" ], "text/plain": [ " Facility ID Gender Length of Stay CCS Diagnosis Code \\\n", "10 37.0 F 3 123 \n", "11 37.0 M 7 122 \n", "12 37.0 M 3 122 \n", "13 37.0 M 2 155 \n", "14 37.0 F 3 122 \n", "15 37.0 F 6 127 \n", "16 37.0 F 4 127 \n", "17 37.0 F 2 197 \n", "18 37.0 M 3 58 \n", "19 37.0 F 2 125 \n", "\n", " CCS Diagnosis Description CCS Procedure Code \\\n", "10 Influenza 0 \n", "11 Pneumonia (except that caused by tuberculosis ... 202 \n", "12 Pneumonia (except that caused by tuberculosis ... 0 \n", "13 Other gastrointestinal disorders 0 \n", "14 Pneumonia (except that caused by tuberculosis ... 0 \n", "15 Chronic obstructive pulmonary disease and bron... 0 \n", "16 Chronic obstructive pulmonary disease and bron... 0 \n", "17 Skin and subcutaneous tissue infections 0 \n", "18 Other nutritional; endocrine; and metabolic di... 0 \n", "19 Acute bronchitis 0 \n", "\n", " CCS Procedure Description Attending Provider License Number \\\n", "10 NO PROC 90335341.0 \n", "11 ELECTROCARDIOGRAM 90335341.0 \n", "12 NO PROC 90335341.0 \n", "13 NO PROC 90335341.0 \n", "14 NO PROC 90335341.0 \n", "15 NO PROC 90335341.0 \n", "16 NO PROC 90335341.0 \n", "17 NO PROC 90301264.0 \n", "18 NO PROC 90335341.0 \n", "19 NO PROC 90335341.0 \n", "\n", " Operating Provider License Number Total Charges \n", "10 NaN $4566.15 \n", "11 167816.0 $9822.90 \n", "12 NaN $5063.05 \n", "13 NaN $3125.75 \n", "14 NaN $5055.45 \n", "15 NaN $9734.05 \n", "16 NaN $7168.05 \n", "17 NaN $2812.85 \n", "18 NaN $3377.50 \n", "19 NaN $3214.25 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[10:20]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "ename": "KeyError", "evalue": "\"None of [Index([0, 3, 5], dtype='int64')] are in the [index]\"", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[21], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mdata\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m10\u001b[39;49m\u001b[43m:\u001b[49m\u001b[38;5;241;43m20\u001b[39;49m\u001b[43m]\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mloc\u001b[49m\u001b[43m[\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m,\u001b[49m\u001b[38;5;241;43m3\u001b[39;49m\u001b[43m,\u001b[49m\u001b[38;5;241;43m5\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m]\u001b[49m\n", "File \u001b[0;32m/opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/indexing.py:1153\u001b[0m, in \u001b[0;36m_LocationIndexer.__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 1150\u001b[0m axis \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maxis \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;241m0\u001b[39m\n\u001b[1;32m 1152\u001b[0m maybe_callable \u001b[38;5;241m=\u001b[39m com\u001b[38;5;241m.\u001b[39mapply_if_callable(key, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobj)\n\u001b[0;32m-> 1153\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_getitem_axis\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmaybe_callable\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43maxis\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m/opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/indexing.py:1382\u001b[0m, in \u001b[0;36m_LocIndexer._getitem_axis\u001b[0;34m(self, key, axis)\u001b[0m\n\u001b[1;32m 1379\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(key, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mndim\u001b[39m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;129;01mand\u001b[39;00m key\u001b[38;5;241m.\u001b[39mndim \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[1;32m 1380\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCannot index with multidimensional key\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m-> 1382\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_getitem_iterable\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43maxis\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1384\u001b[0m \u001b[38;5;66;03m# nested tuple slicing\u001b[39;00m\n\u001b[1;32m 1385\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m is_nested_tuple(key, labels):\n", "File \u001b[0;32m/opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/indexing.py:1322\u001b[0m, in \u001b[0;36m_LocIndexer._getitem_iterable\u001b[0;34m(self, key, axis)\u001b[0m\n\u001b[1;32m 1319\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_validate_key(key, axis)\n\u001b[1;32m 1321\u001b[0m \u001b[38;5;66;03m# A collection of keys\u001b[39;00m\n\u001b[0;32m-> 1322\u001b[0m keyarr, indexer \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_get_listlike_indexer\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1323\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobj\u001b[38;5;241m.\u001b[39m_reindex_with_indexers(\n\u001b[1;32m 1324\u001b[0m {axis: [keyarr, indexer]}, copy\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m, allow_dups\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m\n\u001b[1;32m 1325\u001b[0m )\n", "File \u001b[0;32m/opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/indexing.py:1520\u001b[0m, in \u001b[0;36m_LocIndexer._get_listlike_indexer\u001b[0;34m(self, key, axis)\u001b[0m\n\u001b[1;32m 1517\u001b[0m ax \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobj\u001b[38;5;241m.\u001b[39m_get_axis(axis)\n\u001b[1;32m 1518\u001b[0m axis_name \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobj\u001b[38;5;241m.\u001b[39m_get_axis_name(axis)\n\u001b[0;32m-> 1520\u001b[0m keyarr, indexer \u001b[38;5;241m=\u001b[39m \u001b[43max\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_get_indexer_strict\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis_name\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1522\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m keyarr, indexer\n", "File \u001b[0;32m/opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/indexes/base.py:6114\u001b[0m, in \u001b[0;36mIndex._get_indexer_strict\u001b[0;34m(self, key, axis_name)\u001b[0m\n\u001b[1;32m 6111\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 6112\u001b[0m keyarr, indexer, new_indexer \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_reindex_non_unique(keyarr)\n\u001b[0;32m-> 6114\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_raise_if_missing\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkeyarr\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindexer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis_name\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 6116\u001b[0m keyarr \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtake(indexer)\n\u001b[1;32m 6117\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(key, Index):\n\u001b[1;32m 6118\u001b[0m \u001b[38;5;66;03m# GH 42790 - Preserve name from an Index\u001b[39;00m\n", "File \u001b[0;32m/opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/indexes/base.py:6175\u001b[0m, in \u001b[0;36mIndex._raise_if_missing\u001b[0;34m(self, key, indexer, axis_name)\u001b[0m\n\u001b[1;32m 6173\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m use_interval_msg:\n\u001b[1;32m 6174\u001b[0m key \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mlist\u001b[39m(key)\n\u001b[0;32m-> 6175\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mNone of [\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mkey\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m] are in the [\u001b[39m\u001b[38;5;132;01m{\u001b[39;00maxis_name\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m]\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 6177\u001b[0m not_found \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mlist\u001b[39m(ensure_index(key)[missing_mask\u001b[38;5;241m.\u001b[39mnonzero()[\u001b[38;5;241m0\u001b[39m]]\u001b[38;5;241m.\u001b[39munique())\n\u001b[1;32m 6178\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mnot_found\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m not in index\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n", "\u001b[0;31mKeyError\u001b[0m: \"None of [Index([0, 3, 5], dtype='int64')] are in the [index]\"" ] } ], "source": [ "data[10:20].loc[[0,3,5]]" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Facility IDGenderLength of StayCCS Diagnosis CodeCCS Diagnosis DescriptionCCS Procedure CodeCCS Procedure DescriptionAttending Provider License NumberOperating Provider License NumberTotal Charges
1037.0F3123Influenza0NO PROC90335341.0NaN$4566.15
1337.0M2155Other gastrointestinal disorders0NO PROC90335341.0NaN$3125.75
1537.0F6127Chronic obstructive pulmonary disease and bron...0NO PROC90335341.0NaN$9734.05
\n", "
" ], "text/plain": [ " Facility ID Gender Length of Stay CCS Diagnosis Code \\\n", "10 37.0 F 3 123 \n", "13 37.0 M 2 155 \n", "15 37.0 F 6 127 \n", "\n", " CCS Diagnosis Description CCS Procedure Code \\\n", "10 Influenza 0 \n", "13 Other gastrointestinal disorders 0 \n", "15 Chronic obstructive pulmonary disease and bron... 0 \n", "\n", " CCS Procedure Description Attending Provider License Number \\\n", "10 NO PROC 90335341.0 \n", "13 NO PROC 90335341.0 \n", "15 NO PROC 90335341.0 \n", "\n", " Operating Provider License Number Total Charges \n", "10 NaN $4566.15 \n", "13 NaN $3125.75 \n", "15 NaN $9734.05 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[10:20].iloc[[0,3,5]]" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Boolean Indexing\n", "\n", "Just like `numpy` we can index by a boolean array or an array of indices." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Facility IDGenderLength of StayCCS Diagnosis CodeCCS Diagnosis DescriptionCCS Procedure CodeCCS Procedure DescriptionAttending Provider License NumberOperating Provider License NumberTotal Charges
237.0M150Diabetes mellitus with complications202ELECTROCARDIOGRAM90335341.0167816.0$1609.40
637.0M5122Pneumonia (except that caused by tuberculosis ...0NO PROC90335341.0NaN$6148.10
737.0M3123Influenza0NO PROC90335341.0NaN$4204.15
\n", "
" ], "text/plain": [ " Facility ID Gender Length of Stay CCS Diagnosis Code \\\n", "2 37.0 M 1 50 \n", "6 37.0 M 5 122 \n", "7 37.0 M 3 123 \n", "\n", " CCS Diagnosis Description CCS Procedure Code \\\n", "2 Diabetes mellitus with complications 202 \n", "6 Pneumonia (except that caused by tuberculosis ... 0 \n", "7 Influenza 0 \n", "\n", " CCS Procedure Description Attending Provider License Number \\\n", "2 ELECTROCARDIOGRAM 90335341.0 \n", "6 NO PROC 90335341.0 \n", "7 NO PROC 90335341.0 \n", "\n", " Operating Provider License Number Total Charges \n", "2 167816.0 $1609.40 \n", "6 NaN $6148.10 \n", "7 NaN $4204.15 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[data.Gender == 'M'][:3]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Facility IDCCS Diagnosis CodeCCS Procedure Code
037.01970
137.01460
237.050202
\n", "
" ], "text/plain": [ " Facility ID CCS Diagnosis Code CCS Procedure Code\n", "0 37.0 197 0\n", "1 37.0 146 0\n", "2 37.0 50 202" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[:,[0,3,5]][:3]" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Sorting" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Facility IDGenderLength of StayCCS Diagnosis CodeCCS Diagnosis DescriptionCCS Procedure CodeCCS Procedure DescriptionAttending Provider License NumberOperating Provider License NumberTotal Charges
14269981439.0M1108Congestive heart failure; nonhypertensive0NO PROC251948.0NaN$0.50
780065989.0M2657Mood disorders218PSYCHO/PSYCHI EVAL/THER199003.0154247.0$1.00
781531989.0F1661Substance-related disorders0NO PROC145418.0NaN$1.00
\n", "
" ], "text/plain": [ " Facility ID Gender Length of Stay CCS Diagnosis Code \\\n", "1426998 1439.0 M 1 108 \n", "780065 989.0 M 2 657 \n", "781531 989.0 F 1 661 \n", "\n", " CCS Diagnosis Description CCS Procedure Code \\\n", "1426998 Congestive heart failure; nonhypertensive 0 \n", "780065 Mood disorders 218 \n", "781531 Substance-related disorders 0 \n", "\n", " CCS Procedure Description Attending Provider License Number \\\n", "1426998 NO PROC 251948.0 \n", "780065 PSYCHO/PSYCHI EVAL/THER 199003.0 \n", "781531 NO PROC 145418.0 \n", "\n", " Operating Provider License Number Total Charges \n", "1426998 NaN $0.50 \n", "780065 154247.0 $1.00 \n", "781531 NaN $1.00 " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.sort_values(by=\"Total Charges\").head(3)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Facility IDGenderLength of StayCCS Diagnosis CodeCCS Diagnosis DescriptionCCS Procedure CodeCCS Procedure DescriptionAttending Provider License NumberOperating Provider License NumberTotal Charges
77722213.0F10100Acute myocardial infarction44COR ARTERY BYP GRF-CABG224769.0224769.0$99999.65
451015635.0F1115Aortic; peripheral; and visceral artery aneurysms52AORTIC RESECTION; REPL232988.0232988.0$99999.46
15790471456.0M122Septicemia (except in labor)157AMPUTATE LOWER EXTRMITY258717.0265448.0$99998.45
\n", "
" ], "text/plain": [ " Facility ID Gender Length of Stay CCS Diagnosis Code \\\n", "77722 213.0 F 10 100 \n", "451015 635.0 F 1 115 \n", "1579047 1456.0 M 12 2 \n", "\n", " CCS Diagnosis Description \\\n", "77722 Acute myocardial infarction \n", "451015 Aortic; peripheral; and visceral artery aneurysms \n", "1579047 Septicemia (except in labor) \n", "\n", " CCS Procedure Code CCS Procedure Description \\\n", "77722 44 COR ARTERY BYP GRF-CABG \n", "451015 52 AORTIC RESECTION; REPL \n", "1579047 157 AMPUTATE LOWER EXTRMITY \n", "\n", " Attending Provider License Number Operating Provider License Number \\\n", "77722 224769.0 224769.0 \n", "451015 232988.0 232988.0 \n", "1579047 258717.0 265448.0 \n", "\n", " Total Charges \n", "77722 $99999.65 \n", "451015 $99999.46 \n", "1579047 $99998.45 " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.sort_values(by=\"Total Charges\",ascending=False).head(3)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# String Methods\n", "\n", "Can apply standard string functions to all cells. This returns the changed value; it does not mutate in place." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "data['Total Charges'] = data['Total Charges'].str.lstrip('$').astype(float)" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "The above *overwrites* the previous Total Charges column to be a floating point number instead of a string with a dollar sign." ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Correct sorting" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Facility IDGenderLength of StayCCS Diagnosis CodeCCS Diagnosis DescriptionCCS Procedure CodeCCS Procedure DescriptionAttending Provider License NumberOperating Provider License NumberTotal Charges
9672541169.0M120 +63Diseases of white blood cells64BONE MARROW TRANSPLANT198304.0229870.08593455.88
15600031456.0F120 +143Abdominal hernia86OTHER HERNIA REPAIR165181.0165181.06272871.31
9576851169.0F120 +6Hepatitis176OT ORGAN TRANSPLANTATN236414.0183253.05745201.42
\n", "
" ], "text/plain": [ " Facility ID Gender Length of Stay CCS Diagnosis Code \\\n", "967254 1169.0 M 120 + 63 \n", "1560003 1456.0 F 120 + 143 \n", "957685 1169.0 F 120 + 6 \n", "\n", " CCS Diagnosis Description CCS Procedure Code \\\n", "967254 Diseases of white blood cells 64 \n", "1560003 Abdominal hernia 86 \n", "957685 Hepatitis 176 \n", "\n", " CCS Procedure Description Attending Provider License Number \\\n", "967254 BONE MARROW TRANSPLANT 198304.0 \n", "1560003 OTHER HERNIA REPAIR 165181.0 \n", "957685 OT ORGAN TRANSPLANTATN 236414.0 \n", "\n", " Operating Provider License Number Total Charges \n", "967254 229870.0 8593455.88 \n", "1560003 165181.0 6272871.31 \n", "957685 183253.0 5745201.42 " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.sort_values(by=\"Total Charges\",ascending=False).head(3)" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Creating New Columns" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# setting errors='coerce' below will result in NaN values for invalid input\n", "data['Length of Stay'] = pd.to_numeric(data['Length of Stay'], errors='coerce')\n", "data['Length of Stay'].fillna(120,inplace=True)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "data['Charge per day'] = data['Total Charges']/data['Length of Stay']" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Facility IDGenderLength of StayCCS Diagnosis CodeCCS Diagnosis DescriptionCCS Procedure CodeCCS Procedure DescriptionAttending Provider License NumberOperating Provider License NumberTotal ChargesCharge per day
037.0F8.0197Skin and subcutaneous tissue infections0NO PROC90335341.0NaN9546.851193.356250
137.0F3.0146Diverticulosis and diverticulitis0NO PROC90335341.0NaN11462.753820.916667
237.0M1.050Diabetes mellitus with complications202ELECTROCARDIOGRAM90335341.0167816.01609.401609.400000
337.0F1.0154Noninfectious gastroenteritis202ELECTROCARDIOGRAM90335341.0167816.02638.752638.750000
437.0F3.0124Acute and chronic tonsillitis0NO PROC90335341.0NaN3538.251179.416667
....................................
2367278943.0F1.0245Syncope0NO PROC156102.0NaN10074.0010074.000000
2367279943.0M1.0149Biliary tract disease0NO PROC267443.0NaN21252.0021252.000000
2367280943.0F1.0102Nonspecific chest pain0NO PROC267443.0NaN11673.0011673.000000
2367281943.0F1.0660Alcohol-related disorders171SUTURE SKIN/SUBCUT TISS267443.0249597.016722.0016722.000000
2367282943.0M1.02Septicemia (except in labor)0NO PROC251080.0NaN11637.0011637.000000
\n", "

2367283 rows × 11 columns

\n", "
" ], "text/plain": [ " Facility ID Gender Length of Stay CCS Diagnosis Code \\\n", "0 37.0 F 8.0 197 \n", "1 37.0 F 3.0 146 \n", "2 37.0 M 1.0 50 \n", "3 37.0 F 1.0 154 \n", "4 37.0 F 3.0 124 \n", "... ... ... ... ... \n", "2367278 943.0 F 1.0 245 \n", "2367279 943.0 M 1.0 149 \n", "2367280 943.0 F 1.0 102 \n", "2367281 943.0 F 1.0 660 \n", "2367282 943.0 M 1.0 2 \n", "\n", " CCS Diagnosis Description CCS Procedure Code \\\n", "0 Skin and subcutaneous tissue infections 0 \n", "1 Diverticulosis and diverticulitis 0 \n", "2 Diabetes mellitus with complications 202 \n", "3 Noninfectious gastroenteritis 202 \n", "4 Acute and chronic tonsillitis 0 \n", "... ... ... \n", "2367278 Syncope 0 \n", "2367279 Biliary tract disease 0 \n", "2367280 Nonspecific chest pain 0 \n", "2367281 Alcohol-related disorders 171 \n", "2367282 Septicemia (except in labor) 0 \n", "\n", " CCS Procedure Description Attending Provider License Number \\\n", "0 NO PROC 90335341.0 \n", "1 NO PROC 90335341.0 \n", "2 ELECTROCARDIOGRAM 90335341.0 \n", "3 ELECTROCARDIOGRAM 90335341.0 \n", "4 NO PROC 90335341.0 \n", "... ... ... \n", "2367278 NO PROC 156102.0 \n", "2367279 NO PROC 267443.0 \n", "2367280 NO PROC 267443.0 \n", "2367281 SUTURE SKIN/SUBCUT TISS 267443.0 \n", "2367282 NO PROC 251080.0 \n", "\n", " Operating Provider License Number Total Charges Charge per day \n", "0 NaN 9546.85 1193.356250 \n", "1 NaN 11462.75 3820.916667 \n", "2 167816.0 1609.40 1609.400000 \n", "3 167816.0 2638.75 2638.750000 \n", "4 NaN 3538.25 1179.416667 \n", "... ... ... ... \n", "2367278 NaN 10074.00 10074.000000 \n", "2367279 NaN 21252.00 21252.000000 \n", "2367280 NaN 11673.00 11673.000000 \n", "2367281 249597.0 16722.00 16722.000000 \n", "2367282 NaN 11637.00 11637.000000 \n", "\n", "[2367283 rows x 11 columns]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.to_numeric('120 +',errors='coerce')" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'120 +'" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.to_numeric('120 +',errors='ignore')" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "ename": "ValueError", "evalue": "Unable to parse string \"120 +\" at position 0", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "File \u001b[0;32mlib.pyx:2368\u001b[0m, in \u001b[0;36mpandas._libs.lib.maybe_convert_numeric\u001b[0;34m()\u001b[0m\n", "\u001b[0;31mValueError\u001b[0m: Unable to parse string \"120 +\"", "\nDuring handling of the above exception, another exception occurred:\n", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[35], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mpd\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mto_numeric\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43m120 +\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m/opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/pandas/core/tools/numeric.py:222\u001b[0m, in \u001b[0;36mto_numeric\u001b[0;34m(arg, errors, downcast, dtype_backend)\u001b[0m\n\u001b[1;32m 220\u001b[0m coerce_numeric \u001b[38;5;241m=\u001b[39m errors \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m (\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mignore\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mraise\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 221\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[0;32m--> 222\u001b[0m values, new_mask \u001b[38;5;241m=\u001b[39m \u001b[43mlib\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mmaybe_convert_numeric\u001b[49m\u001b[43m(\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;66;43;03m# type: ignore[call-overload] # noqa: E501\u001b[39;49;00m\n\u001b[1;32m 223\u001b[0m \u001b[43m \u001b[49m\u001b[43mvalues\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 224\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mset\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 225\u001b[0m \u001b[43m \u001b[49m\u001b[43mcoerce_numeric\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcoerce_numeric\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 226\u001b[0m \u001b[43m \u001b[49m\u001b[43mconvert_to_masked_nullable\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype_backend\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mis\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mnot\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mlib\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mno_default\u001b[49m\n\u001b[1;32m 227\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;129;43;01mor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;28;43misinstance\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mvalues_dtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mStringDtype\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 228\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 229\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m (\u001b[38;5;167;01mValueError\u001b[39;00m, \u001b[38;5;167;01mTypeError\u001b[39;00m):\n\u001b[1;32m 230\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m errors \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mraise\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n", "File \u001b[0;32mlib.pyx:2410\u001b[0m, in \u001b[0;36mpandas._libs.lib.maybe_convert_numeric\u001b[0;34m()\u001b[0m\n", "\u001b[0;31mValueError\u001b[0m: Unable to parse string \"120 +\" at position 0" ] } ], "source": [ "pd.to_numeric('120 +')" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Group by\n", "\n", "Group records that have the same value for a column" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.groupby('Facility ID')" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "We can then apply an aggregating function to these groups." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Length of StayCCS Diagnosis CodeCCS Procedure CodeAttending Provider License NumberOperating Provider License NumberTotal ChargesCharge per day
Facility ID
377.02.670732121.170732202.4390244.664694e+074.826434e+074060.7340241614.759785
111.03.611111159.7777780.0000001.929986e+05NaN4492.5877781473.003327
37.03.162791123.90697754.7790708.298531e+077.666251e+064939.8796511628.579552
9250.02.203456205.840173137.9550761.082429e+071.264187e+075618.9288602497.869519
165.04.597884180.206349181.9206352.051072e+054.744077e+076376.9064551580.321499
........................
563.06.279619129.36764378.0852102.363745e+056.581608e+0589385.98645020937.033431
1446.05.633545214.859847157.1904063.497396e+053.955800e+05124204.52107736290.348526
1139.08.937250205.03144687.0445773.341266e+053.847957e+05131992.45499117629.688410
1138.038.527005215.072013213.1080201.963923e+051.966056e+05175997.8058594195.405483
1486.061.821429134.039683194.8015871.863599e+051.921974e+05195780.2725792486.885826
\n", "

215 rows × 7 columns

\n", "
" ], "text/plain": [ " Length of Stay CCS Diagnosis Code CCS Procedure Code \\\n", "Facility ID \n", "377.0 2.670732 121.170732 202.439024 \n", "111.0 3.611111 159.777778 0.000000 \n", "37.0 3.162791 123.906977 54.779070 \n", "9250.0 2.203456 205.840173 137.955076 \n", "165.0 4.597884 180.206349 181.920635 \n", "... ... ... ... \n", "563.0 6.279619 129.367643 78.085210 \n", "1446.0 5.633545 214.859847 157.190406 \n", "1139.0 8.937250 205.031446 87.044577 \n", "1138.0 38.527005 215.072013 213.108020 \n", "1486.0 61.821429 134.039683 194.801587 \n", "\n", " Attending Provider License Number \\\n", "Facility ID \n", "377.0 4.664694e+07 \n", "111.0 1.929986e+05 \n", "37.0 8.298531e+07 \n", "9250.0 1.082429e+07 \n", "165.0 2.051072e+05 \n", "... ... \n", "563.0 2.363745e+05 \n", "1446.0 3.497396e+05 \n", "1139.0 3.341266e+05 \n", "1138.0 1.963923e+05 \n", "1486.0 1.863599e+05 \n", "\n", " Operating Provider License Number Total Charges Charge per day \n", "Facility ID \n", "377.0 4.826434e+07 4060.734024 1614.759785 \n", "111.0 NaN 4492.587778 1473.003327 \n", "37.0 7.666251e+06 4939.879651 1628.579552 \n", "9250.0 1.264187e+07 5618.928860 2497.869519 \n", "165.0 4.744077e+07 6376.906455 1580.321499 \n", "... ... ... ... \n", "563.0 6.581608e+05 89385.986450 20937.033431 \n", "1446.0 3.955800e+05 124204.521077 36290.348526 \n", "1139.0 3.847957e+05 131992.454991 17629.688410 \n", "1138.0 1.966056e+05 175997.805859 4195.405483 \n", "1486.0 1.921974e+05 195780.272579 2486.885826 \n", "\n", "[215 rows x 7 columns]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#note: make sure mean() is only applied to columns with numerical values, or you will get an error\n", "data.groupby('Facility ID').mean(numeric_only=True).sort_values(by='Total Charges')" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Example" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Total Charges
Gender
U9858.198939
F36867.585805
M44419.344760
\n", "
" ], "text/plain": [ " Total Charges\n", "Gender \n", "U 9858.198939\n", "F 36867.585805\n", "M 44419.344760" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.groupby('Gender').mean(numeric_only=True).sort_values(by='Total Charges').loc[:,['Total Charges']]" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "The group by column has become an indexing column. Need to `reset_index` to convert *index* to *columns*." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GenderTotal Charges
0U9858.198939
1F36867.585805
2M44419.344760
\n", "
" ], "text/plain": [ " Gender Total Charges\n", "0 U 9858.198939\n", "1 F 36867.585805\n", "2 M 44419.344760" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.groupby('Gender').mean(numeric_only=True).sort_values(by='Total Charges').reset_index().loc[:,['Gender','Total Charges']]" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Example" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "CCS Procedure Description Gender\n", "ABDOMINAL PARACENTESIS F 56996.536195\n", " M 58532.559231\n", "ABORTION (TERM OF PREG) F 26512.622947\n", "ALCO/DRUG REHAB/DETOX F 18954.578164\n", " M 18187.073614\n", " ... \n", "UPPER GI X-RAY M 31221.460246\n", "URETERAL CATHETERIZATN F 36143.329481\n", " M 34182.121658\n", "VARI VEIN STRIP;LOW LMB F 42519.938571\n", " M 65763.081875\n", "Name: Total Charges, Length: 446, dtype: float64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.groupby(['CCS Procedure Description','Gender'])['Total Charges'].mean()" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "**unstack** Pivot a level of the (necessarily hierarchical) index labels." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "editable": true, "slideshow": { "slide_type": "-" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GenderFMU
CCS Procedure Description
ABDOMINAL PARACENTESIS56996.53619558532.559231NaN
ABORTION (TERM OF PREG)26512.622947NaNNaN
ALCO/DRUG REHAB/DETOX18954.57816418187.073614NaN
AMPUTATE LOWER EXTRMITY100926.63934492821.328596NaN
AORTIC RESECTION; REPL144019.629159114210.484050NaN
............
UNGROUPABLE68101.627500105811.388636NaN
UP GASTRO ENDOSC/BIOPSY42625.45367843340.006917NaN
UPPER GI X-RAY28339.65407431221.460246NaN
URETERAL CATHETERIZATN36143.32948134182.121658NaN
VARI VEIN STRIP;LOW LMB42519.93857165763.081875NaN
\n", "

233 rows × 3 columns

\n", "
" ], "text/plain": [ "Gender F M U\n", "CCS Procedure Description \n", "ABDOMINAL PARACENTESIS 56996.536195 58532.559231 NaN\n", "ABORTION (TERM OF PREG) 26512.622947 NaN NaN\n", "ALCO/DRUG REHAB/DETOX 18954.578164 18187.073614 NaN\n", "AMPUTATE LOWER EXTRMITY 100926.639344 92821.328596 NaN\n", "AORTIC RESECTION; REPL 144019.629159 114210.484050 NaN\n", "... ... ... ..\n", "UNGROUPABLE 68101.627500 105811.388636 NaN\n", "UP GASTRO ENDOSC/BIOPSY 42625.453678 43340.006917 NaN\n", "UPPER GI X-RAY 28339.654074 31221.460246 NaN\n", "URETERAL CATHETERIZATN 36143.329481 34182.121658 NaN\n", "VARI VEIN STRIP;LOW LMB 42519.938571 65763.081875 NaN\n", "\n", "[233 rows x 3 columns]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gendercosts = data.groupby(['CCS Procedure Description','Gender'])['Total Charges'].mean().unstack()\n", "gendercosts" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "CCS Procedure Description\n", "ABDOMINAL PARACENTESIS -1536.023036\n", "ABORTION (TERM OF PREG) NaN\n", "ALCO/DRUG REHAB/DETOX 767.504550\n", "AMPUTATE LOWER EXTRMITY 8105.310748\n", "AORTIC RESECTION; REPL 29809.145110\n", " ... \n", "UNGROUPABLE -37709.761136\n", "UP GASTRO ENDOSC/BIOPSY -714.553239\n", "UPPER GI X-RAY -2881.806172\n", "URETERAL CATHETERIZATN 1961.207823\n", "VARI VEIN STRIP;LOW LMB -23243.143304\n", "Length: 233, dtype: float64" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "genderdiffs = gendercosts.F - gendercosts.M\n", "genderdiffs" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "CCS Procedure Description\n", "ABDOMINAL PARACENTESIS -1536.023036\n", "ALCO/DRUG REHAB/DETOX 767.504550\n", "AMPUTATE LOWER EXTRMITY 8105.310748\n", "AORTIC RESECTION; REPL 29809.145110\n", "APPENDECTOMY 209.527573\n", " ... \n", "UNGROUPABLE -37709.761136\n", "UP GASTRO ENDOSC/BIOPSY -714.553239\n", "UPPER GI X-RAY -2881.806172\n", "URETERAL CATHETERIZATN 1961.207823\n", "VARI VEIN STRIP;LOW LMB -23243.143304\n", "Length: 202, dtype: float64" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "genderdiffs.dropna(inplace=True) #remove rows with NaN, modify genderdiffs in place\n", "genderdiffs" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "CCS Procedure Description\n", "EXTRA CIRC AUX OPEN HRT -69453.636115\n", "DX PRCS ON EYE -63462.345439\n", "SWAN-GANZ CATH MONITOR -46738.846837\n", "PRCS ON SPLEEN -46219.301962\n", "CORNEAL TRANSPLANT -45028.501148\n", "dtype: float64" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "genderdiffs.sort_values().head(5)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "CCS Procedure Description\n", "LENS & CATARACT PRCS 35378.240602\n", "OT INTRAOCULAR THER PRC 35907.146236\n", "BONE MARROW TRANSPLANT 37066.859998\n", "DES LES RETINA/CHOROID 52461.621223\n", "CORONARY THROMBOLYSIS 87870.550000\n", "dtype: float64" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "genderdiffs.sort_values().tail(5)" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Combining DataFrames\n", "\n", "`pd.concat` concatenates rows (i.e., default axis=0) while merging columns with the same name." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "df1 = pd.DataFrame(\n", " {\n", " \"A\": [\"A0\", \"A1\", \"A2\", \"A3\"],\n", " \"B\": [\"B0\", \"B1\", \"B2\", \"B3\"],\n", " \"C\": [\"C0\", \"C1\", \"C2\", \"C3\"],\n", " \"D\": [\"D0\", \"D1\", \"D2\", \"D3\"],\n", " },\n", " index=[0, 1, 2, 3],\n", ")\n", "\n", "df2 = pd.DataFrame(\n", " {\n", " \"B\": [\"B2\", \"B3\", \"B6\", \"B7\"],\n", " \"D\": [\"D2\", \"D3\", \"D6\", \"D7\"],\n", " \"F\": [\"F2\", \"F3\", \"F6\", \"F7\"],\n", " },\n", " index=[2, 3, 6, 7],\n", ")\n" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
0A0B0C0D0
1A1B1C1D1
2A2B2C2D2
3A3B3C3D3
\n", "
" ], "text/plain": [ " A B C D\n", "0 A0 B0 C0 D0\n", "1 A1 B1 C1 D1\n", "2 A2 B2 C2 D2\n", "3 A3 B3 C3 D3" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BDF
2B2D2F2
3B3D3F3
6B6D6F6
7B7D7F7
\n", "
" ], "text/plain": [ " B D F\n", "2 B2 D2 F2\n", "3 B3 D3 F3\n", "6 B6 D6 F6\n", "7 B7 D7 F7" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCDF
0A0B0C0D0NaN
1A1B1C1D1NaN
2A2B2C2D2NaN
3A3B3C3D3NaN
2NaNB2NaND2F2
3NaNB3NaND3F3
6NaNB6NaND6F6
7NaNB7NaND7F7
\n", "
" ], "text/plain": [ " A B C D F\n", "0 A0 B0 C0 D0 NaN\n", "1 A1 B1 C1 D1 NaN\n", "2 A2 B2 C2 D2 NaN\n", "3 A3 B3 C3 D3 NaN\n", "2 NaN B2 NaN D2 F2\n", "3 NaN B3 NaN D3 F3\n", "6 NaN B6 NaN D6 F6\n", "7 NaN B7 NaN D7 F7" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df1,df2])" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Join\n", "\n", "You can **join** two tables **on** a specific column (or columns). Rows that has the same value (or _key_) in that column will be combined. \n", "\n", " * **inner** join - key must exist in both tables\n", " * **outer** join - key can exist in either table\n", " * **left** join - key must exist in left table\n", " * **right** join - key must exist in right table" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "outputs": [], "source": [ "left = pd.DataFrame(\n", " {\n", " \"key\": [\"K0\", \"K1\", \"K2\", \"K3\"],\n", " \"A\": [\"A0\", \"A1\", \"A2\", \"A3\"],\n", " \"B\": [\"B0\", \"B1\", \"B2\", \"B3\"],\n", " }\n", ")\n", "\n", "\n", "right = pd.DataFrame(\n", " {\n", " \"key\": [ \"K1\", \"K2\", \"K4\"],\n", " \"C\": [\"C1\", \"C2\", \"C4\"],\n", " \"D\": [\"D1\", \"D2\", \"D4\"],\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyAB
0K0A0B0
1K1A1B1
2K2A2B2
3K3A3B3
\n", "
" ], "text/plain": [ " key A B\n", "0 K0 A0 B0\n", "1 K1 A1 B1\n", "2 K2 A2 B2\n", "3 K3 A3 B3" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyCD
0K1C1D1
1K2C2D2
2K4C4D4
\n", "
" ], "text/plain": [ " key C D\n", "0 K1 C1 D1\n", "1 K2 C2 D2\n", "2 K4 C4 D4" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "right" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Inner Join" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyABCD
0K1A1B1C1D1
1K2A2B2C2D2
\n", "
" ], "text/plain": [ " key A B C D\n", "0 K1 A1 B1 C1 D1\n", "1 K2 A2 B2 C2 D2" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(left,right,on='key',how='inner')" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Outer Join" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyABCD
0K0A0B0NaNNaN
1K1A1B1C1D1
2K2A2B2C2D2
3K3A3B3NaNNaN
4K4NaNNaNC4D4
\n", "
" ], "text/plain": [ " key A B C D\n", "0 K0 A0 B0 NaN NaN\n", "1 K1 A1 B1 C1 D1\n", "2 K2 A2 B2 C2 D2\n", "3 K3 A3 B3 NaN NaN\n", "4 K4 NaN NaN C4 D4" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(left,right,on='key',how='outer')" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Left Join" ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyABCD
0K0A0B0NaNNaN
1K1A1B1C1D1
2K2A2B2C2D2
3K3A3B3NaNNaN
\n", "
" ], "text/plain": [ " key A B C D\n", "0 K0 A0 B0 NaN NaN\n", "1 K1 A1 B1 C1 D1\n", "2 K2 A2 B2 C2 D2\n", "3 K3 A3 B3 NaN NaN" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(left,right,on='key',how='left')" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "# Questions\n", "\n", "Download: https://bits.csb.pitt.edu/files/Hospital_Inpatient_Discharges__SPARCS_De-Identified___2014.csv.gz\n", " * How many data records are there?\n", " * How many coronary bypasses are there?\n", " * What is the average cost? Standard deviation?\n", " * What is the most common diagnosis that leads to a coronary bypass?\n", " * What percent of people with that diagnosis get a coronary bypass?\n", " * What are the facilities whose average cost for this operation is in the top 10%? Bottom 10%?\n", " * How correlated is the length of stay to the cost?\n", " * Is the percentage of people who go to these facilities with the most common diagnosis and receive a coronary bypass significantly different between these two groups?\n", " \n", " * What about knee replacements?\n", " * How well can a decision tree predict the cost of the operation? What are the most important features?\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "skip" }, "tags": [] }, "source": [ "[**Answers**](emr_project.ipynb)" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 4 }