# The datascience Package

The

`datascience`

package is an open source Python package that helps make programming more accessible to all students, regardless of background. As a pedagogical aid, the package is designed to help students more intuitively conduct data science techniques without first spending considerable time directly learning more complex tools such as `pandas`

or `matplotlib`

. At Berkeley, these other packages are introduced in further upper-division coursework such as Data 100.The

`datascience`

package was built with the main goal to teach students about working with tables and visualizations in an introductory data science setting. It was inspired by techniques in SQL, `pandas`

, and R data frames, and follows a more natural langauge programming design to have a more intuitive way in syntax.The package is built on built-in Python data structures, with several dependencies:

`NumPy`

: a tool for numerical computing and linear algebra. The`datascience`

package relies on`numpy`

arrays as its primary data structure; for example, each column in`datascience`

Table objects are`numpy`

arrays. Often, many`numpy`

functions are also separtely introduced in the course, such as`np.mean`

or`np.append`

.`SciPy`

: a set of tools for scientific computing. The`minimize`

function, used to minimize RMSE, uses the`optimize`

module from`scipy`

.`Matplotlib`

: a tool for visualization. Plotting is directly done in`datascience`

by calling plotting functions on Table objects. Notably, tweaking plots such as renaming titles or adjusting axis shape are abstracted away from students.`pandas`

: the more industry standard tool for data manipulation and analysis. Although`pandas`

is not a significant dependency,`datascience`

supports conversion between its Table objects and`pandas`

dataframes.

The

`datascience`

python package was written by Berkeley professors John DeNero and David Culler, as well as students Sam Lau and Alvin Wan. The full documentation to the `datascience`

package can be found here, but students typically only need the Python Reference Guide for all the functions that are used widely in Data 8.One large barrier to entry in doing data science for many students is the coding knowledge required. Since Data 8 was designed to be highly accessible to students of all backgrounds, the

`datascience`

package was thus created to help make the programming part of the course more accessible to students with no coding background by removing syntax complexities. However, this decision comes with a profound trade-off: the package loses computational flexibility and power for increased ease of understanding and usage compared to industry-standard tools such as `pandas`

. This trade-off was acceptable for teaching Data 8, as datasets and their associated computation are typically not too large (<100 MB), and the computational flexibility required is limited to within the scope of the course.Overall, Data 8 emphasizes developing computational thinking skills over details in the specific syntax. This training allows students to more seamlessly transition to other more complex packages after Data 8.

One limitation from using the

`datascience`

package is that it does not support a wide range of data cleaning procedures. Data 8 abstracts away methods in data cleaning, which will instead be taught in Data 100. As such, students typically receive well-formed data without missing values in Data 8. However, if you plan on placing a larger focus data cleaning or more advanced data manipulation procedures in your course, using `pandas`

may perhaps be more appropriate.Last modified 2yr ago