X Tutup

Exploring Python's First-Class Objects Memory Model

Intermediate

This tutorial is from open-source community. Access the source code

Introduction

In this lab, you will learn about Python's first - class object concept and explore its memory model. Python treats functions, types, and data as first - class objects, allowing for powerful and flexible programming patterns.

You will also create reusable utility functions for CSV data processing. Specifically, you'll create a generalized function for reading CSV data in the reader.py file, which can be reused across different projects.

This is a Guided Lab, which provides step-by-step instructions to help you learn and practice. Follow the instructions carefully to complete each step and gain hands-on experience. Historical data shows that this is a beginner level lab with a 85% completion rate. It has received a 100% positive review rate from learners.

Understanding First-Class Objects in Python

In Python, everything is treated as an object. This includes functions and types. What does this mean? Well, it means that you can store functions and types in data structures, pass them as arguments to other functions, and even return them from other functions. This is a very powerful concept, and we're going to explore it using CSV data processing as an example.

Exploring First-Class Types

First, let's start the Python interpreter. Open a new terminal in the WebIDE and type the following command. This command will start the Python interpreter, which is where we'll be running our Python code.

python3

When working with CSV files in Python, we often need to convert the strings we read from these files into appropriate data types. For example, a number in a CSV file might be read as a string, but we want to use it as an integer or a float in our Python code. To do this, we can create a list of conversion functions.

coltypes = [str, int, float]

Notice that we're creating a list that contains the actual type functions, not strings. In Python, types are first-class objects, which means we can treat them just like any other object. We can put them in lists, pass them around, and use them in our code.

Now, let's read some data from a portfolio CSV file to see how we can use these conversion functions.

import csv
f = open('portfolio.csv')
rows = csv.reader(f)
headers = next(rows)
row = next(rows)
print(row)

When you run this code, you should see output similar to the following. This is the first row of data from the CSV file, represented as a list of strings.

['AA', '100', '32.20']

Next, we'll use the zip function. The zip function takes multiple iterables (like lists or tuples) and pairs up their elements. We'll use it to pair each value from the row with its corresponding type conversion function.

r = list(zip(coltypes, row))
print(r)

This will produce the following output. Each pair contains a type function and a string value from the CSV file.

[(<class 'str'>, 'AA'), (<class 'int'>, '100'), (<class 'float'>, '32.20')]

Now that we have these pairs, we can apply each function to convert the values to their appropriate types.

record = [func(val) for func, val in zip(coltypes, row)]
print(record)

The output will show that the values have been converted to their appropriate types. The string 'AA' remains a string, '100' becomes the integer 100, and '32.20' becomes the float 32.2.

['AA', 100, 32.2]

We can also combine these values with their column names to create a dictionary. A dictionary is a useful data structure in Python that allows us to store key - value pairs.

record_dict = dict(zip(headers, record))
print(record_dict)

The output will be a dictionary where the keys are the column names and the values are the converted data.

{'name': 'AA', 'shares': 100, 'price': 32.2}

You can perform all these steps in a single comprehension. A comprehension is a concise way to create lists, dictionaries, or sets in Python.

result = {name: func(val) for name, func, val in zip(headers, coltypes, row)}
print(result)

The output will be the same dictionary as before.

{'name': 'AA', 'shares': 100, 'price': 32.2}

When you're done working in the Python interpreter, you can exit it by typing the following command.

exit()

This demonstration shows how Python's treatment of functions as first-class objects enables powerful data processing techniques. By being able to treat types and functions as objects, we can write more flexible and concise code.

Creating a Utility Function for CSV Processing

Now that we understand how Python's first-class objects can help us with data conversion, we're going to create a reusable utility function. This function will read CSV data and transform it into a list of dictionaries. This is a very useful operation because CSV files are commonly used to store tabular data, and converting them into a list of dictionaries makes it easier to work with the data in Python.

Creating the CSV Reader Utility

First, open the WebIDE. Once it's open, navigate to the project directory and create a new file named reader.py. In this file, we'll define a function that reads CSV data and applies type conversions. Type conversions are important because the data in a CSV file is usually read as strings, but we might need different data types like integers or floating-point numbers for further processing.

Add the following code to reader.py:

import csv

def read_csv_as_dicts(filename, types):
    """
    Read a CSV file into a list of dictionaries, converting each field according
    to the types provided.

    Parameters:
    filename (str): Name of the CSV file to read
    types (list): List of type conversion functions for each column

    Returns:
    list: List of dictionaries representing the CSV data
    """
    records = []
    with open(filename, 'r') as f:
        rows = csv.reader(f)
        headers = next(rows)  ## Get the column headers

        for row in rows:
            ## Apply type conversions to each value in the row
            converted_row = [func(val) for func, val in zip(types, row)]

            ## Create a dictionary mapping headers to converted values
            record = dict(zip(headers, converted_row))
            records.append(record)

    return records

This function first opens the specified CSV file. It then reads the headers of the CSV file, which are the names of the columns. After that, it goes through each row in the file. For each value in the row, it applies the corresponding type conversion function from the types list. Finally, it creates a dictionary where the keys are the column headers and the values are the converted data, and adds this dictionary to the records list. Once all rows are processed, it returns the records list.

Testing the Utility Function

Let's test our utility function. First, open a terminal and start a Python interpreter by typing:

python3

Now that we're in the Python interpreter, we can use our function to read the portfolio data. The portfolio data is a CSV file that contains information about stocks, such as the name of the stock, the number of shares, and the price.

import reader
portfolio = reader.read_csv_as_dicts('portfolio.csv', [str, int, float])
for record in portfolio[:3]:  ## Show the first 3 records
    print(record)

When you run this code, you should see output similar to:

{'name': 'AA', 'shares': 100, 'price': 32.2}
{'name': 'IBM', 'shares': 50, 'price': 91.1}
{'name': 'CAT', 'shares': 150, 'price': 83.44}

This output shows the first three records from the portfolio data, with the data types correctly converted.

Let's also try our function with the CTA bus data. The CTA bus data is another CSV file that contains information about bus routes, dates, day types, and the number of rides.

rows = reader.read_csv_as_dicts('ctabus.csv', [str, str, str, int])
print(f"Total rows: {len(rows)}")
print("First row:", rows[0])

The output should be something like:

Total rows: 577563
First row: {'route': '3', 'date': '01/01/2001', 'daytype': 'U', 'rides': 7354}

This shows that our function can handle different CSV files and apply the appropriate type conversions.

To exit the Python interpreter, type:

exit()

You've now created a reusable utility function that can read any CSV file and apply appropriate type conversions. This demonstrates the power of Python's first-class objects and how they can be used to create flexible, reusable code.

Exploring Python's Memory Model

Python's memory model plays a crucial role in determining how objects are stored in memory and how they are referenced. Understanding this model is essential, especially when dealing with large datasets, as it can significantly impact the performance and memory usage of your Python programs. In this step, we'll specifically focus on how string objects are handled in Python and explore ways to optimize memory usage for large datasets.

String Repetition in Datasets

The CTA bus data contains many repeated values, such as route names. Repeated values in a dataset can lead to inefficient memory usage if not handled properly. To understand the extent of this issue, let's first examine how many unique route strings are in the dataset.

First, open a Python interpreter. You can do this by running the following command in your terminal:

python3

Once the Python interpreter is open, we'll load the CTA bus data and find the unique routes. Here's the code to achieve this:

import reader
rows = reader.read_csv_as_dicts('ctabus.csv', [str, str, str, int])

## Find unique route names
routes = {row['route'] for row in rows}
print(f"Number of unique route names: {len(routes)}")

In this code, we first import the reader module, which presumably contains a function to read CSV files as dictionaries. We then use the read_csv_as_dicts function to load the data from the ctabus.csv file. The second argument [str, str, str, int] specifies the data types for each column in the CSV file. After that, we use a set comprehension to find all the unique route names in the dataset and print the number of unique route names.

The output should be:

Number of unique route names: 181

Now, let's check how many different string objects are created for these routes. Even though there are only 181 unique route names, Python might create a new string object for each occurrence of a route name in the dataset. To verify this, we'll use the id() function to get the unique identifier of each string object.

## Count unique string object IDs
routeids = {id(row['route']) for row in rows}
print(f"Number of unique route string objects: {len(routeids)}")

The output might surprise you:

Number of unique route string objects: 542305

This shows that there are only 181 unique route names, but over 500,000 unique string objects. This happens because Python creates a new string object for each row, even if the values are the same. This can lead to significant memory waste, especially when dealing with large datasets.

String Interning to Save Memory

Python provides a way to "intern" (reuse) strings using the sys.intern() function. String interning can save memory when you have many duplicate strings in your dataset. When you intern a string, Python checks if an identical string already exists in the intern pool. If it does, it returns a reference to the existing string object instead of creating a new one.

Let's demonstrate how string interning works with a simple example:

import sys

## Without interning
a = 'hello world'
b = 'hello world'
print(f"a is b (without interning): {a is b}")

## With interning
a = sys.intern(a)
b = sys.intern(b)
print(f"a is b (with interning): {a is b}")

In this code, we first create two string variables a and b with the same value without interning. The is operator checks if two variables refer to the same object. Without interning, a and b are different objects, so a is b returns False. Then, we intern both strings using sys.intern(). After interning, a and b refer to the same object in the intern pool, so a is b returns True.

The output should be:

a is b (without interning): False
a is b (with interning): True

Now, let's use string interning when reading the CTA bus data to reduce memory usage. We'll also use the tracemalloc module to track the memory usage before and after interning.

import sys
import reader
import tracemalloc

## Start memory tracking
tracemalloc.start()

## Read data with interning for the route column
rows = reader.read_csv_as_dicts('ctabus.csv', [sys.intern, str, str, int])

## Check unique route objects again
routeids = {id(row['route']) for row in rows}
print(f"Number of unique route string objects (with interning): {len(routeids)}")

## Check memory usage
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 1024 / 1024:.2f} MB")
print(f"Peak memory usage: {peak / 1024 / 1024:.2f} MB")

In this code, we first start memory tracking using tracemalloc.start(). Then, we read the CTA bus data with interning for the route column by passing sys.intern as the data type for the first column. After that, we check the number of unique route string objects again and print the current and peak memory usage.

The output should be something like:

Number of unique route string objects (with interning): 181
Current memory usage: 189.56 MB
Peak memory usage: 209.32 MB

Let's restart the interpreter and try interning both route and date strings to see if we can reduce the memory usage further.

exit()

Start Python again:

python3
import sys
import reader
import tracemalloc

## Start memory tracking
tracemalloc.start()

## Read data with interning for both route and date columns
rows = reader.read_csv_as_dicts('ctabus.csv', [sys.intern, sys.intern, str, int])

## Check memory usage
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage (interning route and date): {current / 1024 / 1024:.2f} MB")
print(f"Peak memory usage (interning route and date): {peak / 1024 / 1024:.2f} MB")

The output should show a further decrease in memory usage:

Current memory usage (interning route and date): 170.23 MB
Peak memory usage (interning route and date): 190.05 MB

This demonstrates how understanding Python's memory model and using techniques like string interning can help optimize your programs, especially when dealing with large datasets containing repeated values.

Finally, exit the Python interpreter:

exit()

Column-Oriented Data Storage

So far, we've been storing CSV data as a list of row dictionaries. This means that each row in the CSV file is represented as a dictionary, where the keys are the column headers and the values are the corresponding data in that row. However, when dealing with large datasets, this method can be inefficient. Storing data in a column-oriented format can be a better choice. In a column-oriented approach, each column's data is stored in a separate list. This can significantly reduce memory usage because similar data types are grouped together, and it can also improve performance for certain operations like aggregating data by column.

Creating a Column-Oriented Data Reader

Now, we're going to create a new file that will help us read CSV data in a column-oriented format. Create a new file named colreader.py in the project directory with the following code:

import csv

class DataCollection:
    def __init__(self, headers, columns):
        """
        Initialize a column-oriented data collection.

        Parameters:
        headers (list): Column header names
        columns (dict): Dictionary mapping header names to column data lists
        """
        self.headers = headers
        self.columns = columns
        self._length = len(columns[headers[0]]) if headers else 0

    def __len__(self):
        """Return the number of rows in the collection."""
        return self._length

    def __getitem__(self, index):
        """
        Get a row by index, presented as a dictionary.

        Parameters:
        index (int): Row index

        Returns:
        dict: Dictionary representing the row at the given index
        """
        if isinstance(index, int):
            if index < 0 or index >= self._length:
                raise IndexError("Index out of range")

            return {header: self.columns[header][index] for header in self.headers}
        else:
            raise TypeError("Index must be an integer")

def read_csv_as_columns(filename, types):
    """
    Read a CSV file into a column-oriented data structure, converting each field
    according to the types provided.

    Parameters:
    filename (str): Name of the CSV file to read
    types (list): List of type conversion functions for each column

    Returns:
    DataCollection: Column-oriented data collection representing the CSV data
    """
    with open(filename, 'r') as f:
        rows = csv.reader(f)
        headers = next(rows)  ## Get the column headers

        ## Initialize columns
        columns = {header: [] for header in headers}

        ## Read data into columns
        for row in rows:
            ## Convert values according to the specified types
            converted_values = [func(val) for func, val in zip(types, row)]

            ## Add each value to its corresponding column
            for header, value in zip(headers, converted_values):
                columns[header].append(value)

    return DataCollection(headers, columns)

This code does two important things:

  1. It defines a DataCollection class. This class stores data in columns, but it allows us to access the data as if it were a list of row dictionaries. This is useful because it provides a familiar way to work with the data.
  2. It defines a read_csv_as_columns function. This function reads CSV data from a file and stores it in a column-oriented structure. It also converts each field in the CSV file according to the types we provide.

Testing the Column-Oriented Reader

Let's test our column-oriented reader using the CTA bus data. First, open a Python interpreter. You can do this by running the following command in your terminal:

python3

Once the Python interpreter is open, run the following code:

import colreader
import tracemalloc
from sys import intern

## Start memory tracking
tracemalloc.start()

## Read data into column-oriented structure with string interning
data = colreader.read_csv_as_columns('ctabus.csv', [intern, intern, intern, int])

## Check that we can access the data like a list of dictionaries
print(f"Number of rows: {len(data)}")
print("First 3 rows:")
for i in range(3):
    print(data[i])

## Check memory usage
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 1024 / 1024:.2f} MB")
print(f"Peak memory usage: {peak / 1024 / 1024:.2f} MB")

The output should look like this:

Number of rows: 577563
First 3 rows:
{'route': '3', 'date': '01/01/2001', 'daytype': 'U', 'rides': 7354}
{'route': '4', 'date': '01/01/2001', 'daytype': 'U', 'rides': 9288}
{'route': '6', 'date': '01/01/2001', 'daytype': 'U', 'rides': 6048}
Current memory usage: 38.67 MB
Peak memory usage: 103.42 MB

Now, let's compare this with our previous row-oriented approach. Run the following code in the same Python interpreter:

import reader
import tracemalloc
from sys import intern

## Reset memory tracking
tracemalloc.reset_peak()

## Read data into row-oriented structure with string interning
rows = reader.read_csv_as_dicts('ctabus.csv', [intern, intern, intern, int])

## Check memory usage
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage (row-oriented): {current / 1024 / 1024:.2f} MB")
print(f"Peak memory usage (row-oriented): {peak / 1024 / 1024:.2f} MB")

The output should be something like this:

Current memory usage (row-oriented): 170.23 MB
Peak memory usage (row-oriented): 190.05 MB

As you can see, the column-oriented approach uses significantly less memory!

Let's also test that we can still analyze the data as before. Run the following code:

## Find all unique routes in the column-oriented data
routes = {row['route'] for row in data}
print(f"Number of unique routes: {len(routes)}")

## Count rides per route (first 5)
from collections import defaultdict
route_rides = defaultdict(int)
for row in data:
    route_rides[row['route']] += row['rides']

## Show the top 5 routes by total rides
top_routes = sorted(route_rides.items(), key=lambda x: x[1], reverse=True)[:5]
print("Top 5 routes by total rides:")
for route, rides in top_routes:
    print(f"Route {route}: {rides:,} rides")

The output should be:

Number of unique routes: 181
Top 5 routes by total rides:
Route 9: 158,545,826 rides
Route 49: 129,872,910 rides
Route 77: 120,086,065 rides
Route 79: 109,348,708 rides
Route 4: 91,405,538 rides

Finally, exit the Python interpreter by running the following command:

exit()

We can see that the column-oriented approach not only saves memory but also allows us to perform the same analyses as before. This shows how different data storage strategies can have a significant impact on performance while still providing the same interface for us to work with the data.

Summary

In this lab, you have learned several key Python concepts. First, you understood how Python treats functions, types, and other entities as first-class objects, allowing them to be passed around and stored like regular data. Second, you created reusable utility functions for CSV data processing with automatic type conversion.

Moreover, you explored Python's memory model and used string interning to reduce memory usage for repetitive data. You also implemented a more efficient column-oriented storage method for large datasets, providing a familiar user interface. These concepts showcase Python's flexibility and power in data processing, and the techniques can be applied to real-world data analysis projects.

X Tutup