Step by step approach to perform data analysis using Python

So you have decided to learn Python, but you don’t have prior programming experience. So you are confused on where to start, and how much Python to learn. These are some of the common questions a beginner has while getting started with Python(for data centric application).

“How long does it take to learn Python”

“How much Python should I learn for performing data analysis”

“What are the best books/courses to learn Python”

“Should I be an expert  Python programmer, in order to work with data sets”

It is good to be confused, while beginning to learn a new skill, that’s what author of “learn anything in 20 hours” says. Don’t Panic, I will show you how to get started quickly without becoming a coding ninja in Python.

Don’t make the mistake I did

Before getting started with Python, I had a misconception that for performing data analysis in Python, I have to be proficient in Python programming. So I took Udacity’s intro Python programming course, completed code academy Python tutorials and read several Python programming books. For 3 months(spending 3 hours per day), I was learning Python programming by completing small software projects. Coding was fun, but my goal was not to become a Python developer, but to do data analysis using Python. Then I realized that I was spending more time learning how to develop software in Python, rather than doing data analysis.

After a few hours of research, I found out that I need to learn 5 Python libraries to effectively solve a broad set of data analysis problems. Then I started learning these libraries one after the other.

In my opinion, it is not necessary to become proficient at building good software in Python to be able to productively perform data analysis.

Ignore the resources intended for general audience

While there are many excellent Python books and online courses, I wouldn’t recommend some of them as they are intended for a general audience rather than for some one who wants to do data analysis. Also there are couple of books on “Scientific Programming in Python”, but they are geared toward various topics that are mathematically-oriented rather than being about data analysis and statistics. Don’t waste your time, by taking courses and reading books that are intended for general audience.

Before proceeding further, first set up your programming environment, and learn how to work in IPython notebook.

Learning Pathway

Start with code academy, complete all the exercises in code academy. You can complete the exercises in 20 days, by investing 3 hours per day. Code academy covers all the basic Python concepts. But it doesn’t follow a project oriented approach like Udacity; that doesn’t matter, because your goal is to work on data science projects, not on building software using Python.

After completing the code academy exercises go through this I python notebook:

Python Essentials Tutorials (I have provided the links to download the file in conclusion part)

It consists of concepts that are not covered in code academy.You can complete this tutorial within an hour or two.

Now you know enough basics to start learning Python libraries.

Numpy

First, start learning NumPy as it is the fundamental package for scientific computing with Python. A good understanding of Numpy will help you use tools like Pandas effectively.

I have prepared an IPython note book, that includes the basic concepts of Numpy. The tutorial covers the most frequently performed operations in Numpy, such as, working with N-dimensional array, Indexing and slicing of arrays, Indexing using integer arrays, transposing an array, universal functions, data processing using arrays, frequently used statistical methods, etc.

Numpy Basics Tutorial

Pandas

Pandas contain high level data structures and manipulation tools to make data analysis fast and easy in Python.

Tutorial includes working with series, data frames, dropping entries from an axis, working with missing values, etc.

Pandas Basics Tutorial

Matplotlib

This is a four part Matplotlib tutorial.

1st part:

First part introduces the basic functionalities of Matplotlib, the basic figure types.

MatplotLib Part 1

2nd part:

Covers how to control the style and color of a figure, such as markers, line thickness, line patterns and using color maps.

MatplotLib Part2

3rd part:

Annotation of a figure- compositing several figures, controlling the axis range, aspect ratio and coordinate system.

MatplotLib Part3

4th part:

Covers working with complex figures.

MatplotLib Part4

Conclusion

One of the easiest mistakes you can make when learning Python is attempting to learn too many libraries at the same time. When you try to learn everything at once, you spend too much time switching between different concepts, getting frustrated, and move on to something else.

So focus stick on to this process:

Understand Python basics

Learn Numpy

Learn Pandas

Learn Matplotlib

Links to Download:

You can download the files from my github account. The files are in .ipynb format. The files also includes the pictures I have used for illustration.

1) Python and Numpy Basics

2) Pandas Basics

3) Matplotlib

If you have any queries, feel free to ask in comments.

Originally Published on Bigdata-madesimple


Posted

in

by

Tags: