Pandas (software)
Pandas (styled as pandas) is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license.[2] The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals,[3] as well as a play on the phrase "Python data analysis".[4]: 5 Wes McKinney started building what would become Pandas at AQR Capital while he was a researcher there from 2007 to 2010.[5] The development of Pandas introduced into Python many comparable features of working with DataFrames that were established in the R programming language.[6] The library is built upon another library, NumPy. HistoryDeveloper Wes McKinney started working on Pandas in 2008 while at AQR Capital Management out of the need for a high performance, flexible tool to perform quantitative analysis on financial data. Before leaving AQR, he was able to convince management to allow him to open source the library. Another AQR employee, Chang She, joined the effort in 2012 as the second major contributor to the library. In 2015, Pandas signed on as a fiscally sponsored project of NumFOCUS, a 501(c)(3) nonprofit charity in the United States.[7] Data ModelPandas is built around data structures called Series and DataFrames. Data for these collections can be imported from various file formats such as comma-separated values, JSON, Parquet, SQL database tables or queries, and Microsoft Excel.[8] A Series is a 1-dimensional data structure built on top of NumPy's array.[9]: 97 Unlike in NumPy, each data point has an associated label. The collection of these labels is called an index.[4]: 112 Series can be used arithmetically, as in the statement Users can transform or summarize data by applying arbitrary functions.[4]: 132 Since Pandas is built on top of NumPy, all NumPy functions work on Series and DataFrames as well.[9]: 115 Pandas also includes built-in operations for arithmetic, string manipulation, and summary statistics such as mean, median, and standard deviation.[4]: 139, 211 These built-in functions are designed to handle missing data, usually represented by the floating-point value NaN.[4]: 142–143 Subsets of data can be selected by column name, index, or Boolean expressions. For example, Pandas includes support for time series, such as the ability to interpolate values [4]: 316–317 and filter using a range of timestamps (e.g. IndicesBy default, a Pandas index is a series of integers ascending from 0, similar to the indices of Python arrays. However, indices can use any NumPy data type, including floating point, timestamps, or strings.[4]: 112 Pandas' syntax for mapping index values to relevant data is the same syntax Python uses to map dictionary keys to values. For example, if If Pandas supports hierarchical indices with multiple values per data point. An index with this structure, called a "MultiIndex", allows a single DataFrame to represent multiple dimensions, similar to a pivot table in Microsoft Excel.[4]: 147–148 Each level of a MultiIndex can be given a unique name.[9]: 133 In practice, data with more than 2 dimensions is often represented using DataFrames with hierarchical indices, instead of the higher-dimension Panel and Panel4D data structures[9]: 128 CriticismsPandas has been criticized for its inefficiency. Pandas can require 5 to 10 times as much memory as the size of the underlying data, and the entire dataset must be loaded in RAM. The library does not optimize query plans or support parallel computing across multiple cores. Wes McKinney, the creator of Pandas, has recommended Apache Arrow as an alternative to address these performance concerns and other limitations.[10] See also
References
Further reading
|
Portal di Ensiklopedia Dunia