Better Programming

Advice for programmers.

How To Manage Files in Google Colab

Siddhant Sadangi
Better Programming
Published in
12 min readMay 4, 2021

--

colored file hangers
Photo by Andrew Pons on Unsplash

Stuck behind the paywall? Click here to read the full article with my friend link.

Google Colaboratory is a free Jupyter notebook environment that runs on Google’s cloud servers, letting the user leverage backend hardware, like GPUs and TPUs. This lets you do everything you can do in a Jupyter notebook hosted on your local machine without requiring the installations and setup for hosting a notebook in your local machine.

Colab comes with (almost) all the setup you need to start coding, but what it doesn’t have out of the box is your data sets. How do you access your data from within Colab?

In this article we will talk about:

  • How to load data to Colab from a multitude of data sources
  • How to write back to those data sources from within Colab
  • Limitations of Google Colab while working with external files

Directory and File Operations in Google Colab

Since Colab lets you do everything which you can in a locally hosted Jupyter notebook, you can also use shell commands like ls, dir, pwd, cd, cat, echo, et cetera, using line-magic (%) or bash (!).

To browse the directory structure, you can use the file explorer pane on the left.

Colab screen showing directories
Browsing directories in Colab

How To Upload files to and Download Files From Google Colab

Since a Colab notebook is hosted on Google’s cloud servers, there’s no direct access to files on your local drive (unlike a notebook hosted on your machine) or any other environment by default.

However, Colab provides various options to connect to almost any data source you can imagine. Let us see how.

Accessing GitHub From Google Colab

You can either clone an entire GitHub repository to your Colab environment or access individual files from their raw link.

Clone a GitHub repository

You can clone a GitHub repository into your Colab environment in the same way as you would in your local machine, using git clone. Once the repository is cloned, refresh the file explorer to browse through its contents.

Then you can simply read the files as you would in your local machine.

Colab screen showing repositories
Cloning a repository into Colab

Load individual files directly from GitHub

In case you just have to work with a few files rather than the entire repository, you can load them directly from GitHub without needing to clone the repository to Colab.

To do this:

  1. Click on the file in the repository.
  2. Click on View Raw.
  3. Copy the URL of the raw file.
  4. Use this URL as the location of your file.
Colab screen showing individual files
Reading files loaded to Colab

Accessing Local File System With Google Colab

You can read from or write to your local file system either using the file explorer or Python code.

Access local files through the file explorer

Uploading files from local file system through file explorer

You can either use the upload option at the top of the file explorer pane to upload any file(s) from your local file system to Colab in the present working directory.

To upload files directly to a sub-directory you need to:

  1. Click on the three dots visible when you hover above the directory.
  2. Select the Upload option.
Colab screen showing Upload button
Uploading files to Colab from the local filesystem

3. Select the file(s) you wish to upload from the File Upload dialog window.

4. Wait for the upload to complete. The upload progress is shown at the bottom of the file explorer pane.

Colab screen showing upload progress
Upload progress in Colab

Once the upload is complete, you can read from the file as you would normally.

Colab screen showing individual files
Reading files loaded to Colab

Downloading files to local file system through file explorer

Click on the three dots which are visible while hovering above the filename, and select the Download option.

Colab screen showing Download button
Downloading files from Colab to local

Accessing local file system using Python code

This step requires you to first import the files module from the google.colab library:

from google.colab import files

Uploading files from local file system using Python code

You use the upload method of the files object:

uploaded = files.upload()

Running this opens the File Upload dialog window:

Colab File Upload dialog window
Uploading files to Colab programmatically

Select the file(s) you wish to upload and then wait for the upload to complete. The upload progress is displayed:

Colab screen showing file upload progress
Programmatic file upload progress

The uploaded object is a dictionary having the filename and content as its key-value pairs:

Colab screen showing uploaded object structure
Uploaded object structure

Once the upload is complete, you can read it as you would any other file from Colab:

df4 = pd.read_json("News_Category_Dataset_v2.json", lines=True)

Or you can read it directly from the uploaded dict using the io library:

import io
df5 = pd.read_json(io.BytesIO(uploaded['News_Category_Dataset_v2.json']), lines=True)

Make sure that the filename matches the name of the file you wish to load.

Downloading files from Colab to local file system using Python code

The download method of the files object can be used to download any file from Colab to your local drive. The download progress is displayed, and once the download completes, you can choose where to save it in your local machine.

Colab screen showing download progress
Downloading files programmatically from Colab

Accessing Google Drive From Google Colab

You can use the drive module from google.colab to mount your entire Google Drive to Colab through the following steps:

1. Execute the below code, which will provide you with an authentication link.

from google.colab import drive
drive.mount('/content/gdrive')

2. Open the link.

3. Choose the Google account whose drive you want to mount.

4. Allow Google Drive Stream access to your Google account.

5. Copy the code displayed, paste it in the text box as shown below, and press Enter.

Colab screen showing dialog box to enter authentication code to mount Google Drive
Mounting Google Drive to Colab

Once the drive is mounted, you’ll get the message “Mounted at /content/gdrive,” and you’ll be able to browse through the contents of your drive from the file explorer pane.

Colab screen showing Google drive folders and files
Exploring Google Drive in Colab

Now you can interact with your Google drive as if it were a folder in your Colab environment. Any changes to this folder will be reflected directly in your Google drive. You can read the files in your Google drive as any other file.

You can even write directly to Google Drive from Colab using the usual file/directory operations.

! touch "/content/gdrive/My Drive/sample_file.txt"

This will create a file in your Google drive and will be visible in the file explorer pane once you refresh it:

Colab screen showing new file entered to Google drive
Uploading file from Colab to Google Drive
Google Drive screen showing new file entered via Colab
Uploading file from Colab to Google Drive

Accessing Google Sheets From Google Colab

To access Google Sheets:

  1. You need to first authenticate the Google account to be linked with Colab by running the code below:
from google.colab import auth
auth.authenticate_user()

2. Executing the above code will provide you with an authentication link. Open the link.

3. Choose the Google account which you want to link.

4. Allow Google Cloud SDK to access your Google account.

5. Finally, copy the code displayed, paste it in the text box shown, and hit Enter.

Dialog box for authenticating Google Cloud SDK
Authenticating Google Cloud SDK

To interact with Google Sheets, you need to import the preinstalled gspread library. And to authorize gspread access to your Google account, you need the GoogleCredentials method from the preinstalled oauth2client.client library:

import gspread
from oauth2client.client import GoogleCredentials

gc = gspread.authorize(GoogleCredentials.get_application_default())

Once the above code is run, an application default credentials (ADC) JSON file will be created in the present working directory. This contains the credentials used by gspread to access your Google account.

screen showing application default credentials (ADC) JSON file has been created
adc.json file for Google Spreadsheets

Once this is done, you can now create or load Google sheets directly from your Colab environment.

Creating/updating a Google sheet in Colab

  1. Use the gc object’s create method to create a workbook:
wb = gc.create('demo')

2. Once the workbook is created, you can view it at sheets.google.com.

Google Sheet screen
Uploading sheets from Colab to Google Sheets

3. To write values to the workbook, first open a worksheet:

ws = gc.open('demo').sheet1

4. Then select the cell(s) you want to write to:

Updating sheets in Google Sheets in Colab (1)

5. This creates a list of cells with their index (R1C1) and value (currently blank). You can modify the individual cells by updating their value attribute:

Updating sheets in Google Sheets in Colab (2)

6. To update these cells in the worksheet, use the update_cells method:

Updating sheets in Google Sheets in Colab (3)

7. The changes will now be reflected in your Google sheet.

Google sheet showing entries updated via Colab
Viewing updates

Downloading data from a Google sheet

1. Use the gc object’s open method to open a workbook:

wb = gc.open('demo')

2. Then read all the rows of a specific worksheet by using the get_all_values method:

Downloading sheets from Google Sheets to Colab

3. To load these to a data frame, you can use the DataFrame object’s from_record method:

Creating a data frame from downloaded Google Sheets

Accessing Google Cloud Storage (GCS) From Google Colab

You need to have a Google Cloud project (GCP) to use GCS. You can create and access your GCS buckets in Colab via the preinstalled gsutil command-line utility.

1. First specify your project ID:

project_id = '<project_ID>'

2. To access GCS, you have to authenticate your Google account:

from google.colab import auth
auth.authenticate_user()

3. Executing the above code will provide you with an authentication link. Open the link.

4. Choose the Google account which you want to link.

5. Allow Google Cloud SDK to access your Google account.

6. Finally, copy the code displayed, paste it in the text box shown, and hit Enter.

dialog box to enter authentication code for Google Cloud Storage
Authenticating Google Could SDK

7. Then you configure gsutil to use your project:

! gcloud config set project {project_id}

8. You can make a bucket using the “make bucket” (mb) command. GCP buckets must have a universally unique name, so use the preinstalled uuid library to generate a universally unique ID:

import uuidbucket_name = f'sample-bucket-{uuid.uuid1()}'
! gsutil mb gs://{bucket_name}

9. Once the bucket is created, you can upload a file from your Colab environment to it:

! gsutil cp /tmp/to_upload.txt gs://{bucket_name}/

10. Once the upload has finished, the file will be visible in the GCS browser for your project: https://console.cloud.google.com/storage/browser?project=<project_id>

! gsutil cp gs://{bucket_name}/{filename} {download_location}

Once the download has finished, the file will be visible in the Colab file explorer pane in the download location specified.

Accessing AWS S3 From Google Colab

You need to have an AWS account, configure IAM, and generate your access key and secret access key to be able to access S3 from Colab. You also need to install the awscli library to your Colab environment:

1. Install the awscli library:

! pip install awscli

2. Once installed, configure AWS by running aws configure:

AWS configuration

3. Enter your access_key and secret_access_key in the text boxes, and press Enter.

Then you can download any file from S3:

! aws s3 cp s3://{bucket_name} ./{download_location} --recursive --exclude "*" --include {filepath_on_s3}

filepath_on_s3 can point to a single file or match multiple files using a pattern.

You will be notified once the download is complete, and the downloaded file(s) will be available in the location you specified to be used as you wish.

To upload a file, just reverse the source and destination arguments:

! aws s3 cp ./{upload_from} s3://{bucket_name} --recursive --exclude "*" --include {file_to_upload}

file_to_upload can point to a single file or match multiple files using a pattern.

You will be notified once the upload is complete, and the uploaded file(s) will be available in your S3 bucket in the folder specified: https://s3.console.aws.amazon.com/s3/buckets/{bucket_name}/{folder}/?region={region}

Accessing Kaggle Data Sets From Google Colab

To download data sets from Kaggle, you first need a Kaggle account and an API token.

  1. To generate your API token, go to My Account, then Create New API Token.
  2. Open the kaggle.json file and copy its contents. It should be in the form of {"username":"########", "key":"################################"}.
  3. Then run the below commands in Colab:
! mkdir ~/.kaggle #create the .kaggle folder in your root directory
! echo '<PASTE_CONTENTS_OF_KAGGLE_API_JSON>' > ~/.kaggle/kaggle.json #write kaggle API credentials to kaggle.json
! chmod 600 ~/.kaggle/kaggle.json # set permissions
! pip install kaggle #install the kaggle library

4. Once the kaggle.json file has been created in Colab and the Kaggle library has been installed, you can search for a data set using the following:

! kaggle datasets list -s {KEYWORD}

5. Then download the data set using:

! kaggle datasets download -d {DATASET NAME} -p /content/kaggle/

The data set will be downloaded and will be available in the path specified (/content/kaggle/ in this case).

Accessing MySQL Databases From Google Colab

1. You need to import the preinstalled sqlalchemy library to work with relational databases.

import sqlalchemy

2. Enter the connection details and create the engine:

HOSTNAME = 'ENTER_HOSTNAME'
USER = 'ENTER_USERNAME'
PASSWORD = 'ENTER_PASSWORD'
DATABASE = 'ENTER_DATABASE_NAME'

connection_string = f'mysql+pymysql://{MYSQL_USER}:{MYSQL_PASSWORD}@{MYSQL_HOSTNAME}/{MYSQL_DATABASE}'

engine = sqlalchemy.create_engine(connection_string)

3. Finally, just create the SQL query, and load the query results to a data frame using pd.read_sql_query():

query = f"SELECT * FROM {DATABASE}.{TABLE}"

import pandas as pd
df = pd.read_sql_query(query, engine)

Limitations of Google Colab When Working With Files

One important caveat to remember while using Colab is that the files you upload to it won’t be available forever. Colab is a temporary environment with an idle timeout of 90 minutes and an absolute timeout of 12 hours. This means that the runtime will disconnect if it has remained idle for 90 minutes or if it has been in use for 12 hours. On disconnection, you lose all your variables, states, installed packages, and files and will be connected to an entirely new and clean environment on reconnecting.

Also, Colab has a disk space limitation of 108 GB, of which only 77 GB is available to the user. While this should be enough for most tasks, keep this in mind while working with larger data sets, like images or video data.

Conclusion

Google Colab is a great tool for individuals who want to harness the power of high-end computing resources, like GPUs, without being restricted by their price.

In this article, we have gone through most of the ways you can supercharge your Google Colab experience by reading external files or data in Google Colab and writing from Google Colab to those external data sources.

Depending on your use case or how your data architecture is set up, you can easily apply the above-mentioned methods to connect your data source directly to Colab and start coding.

Medium still does not support payouts to authors based out of India. If you like my content, you can buy me a coffee :)

--

--

Siddhant Sadangi
Siddhant Sadangi

Written by Siddhant Sadangi

ML Developer Advocate @Neptune.ai | Ex — Data Scientist @Reuters | Ex — ETL Developer @Deloitte | linkedin.com/in/siddhantsadangi | siddhant.sadangi@gmail.com