This project is an implementaton of a dataframe akin to Python's Pandas or R's tibble/dplyr/tidyverse. R's influence is significantly stronger for this project, although I do borrow ideas from Pandas as well.
Main advantages are decent data organization; full support of NA-values; good extensibility.
Dataframes and vectors (series in Pandas) are immutable.
Supported types are: Integer, Float, Complex, String, Boolean, Time, Any and Vector (each element of which is a vector).
iris, err := dataframe.FromCSVFile("iris.csv")
To skip the first line use CSVOptionSkipFirstLine(true)
.
iris, err := dataframe.FromCSVFile("iris.csv", dataframe.CSVOptionSkipFirstLine(true))
If you need to pass options for the new dataframe, use CSVOptionDataframeOptions(options...)
.
db, err := sql.Open("sqlite3", "./test_data/items.sqlite")
if err != nil {
...
}
tx, err := db.Begin()
if err != nil {
...
}
df, err := FromSQL(tx, "SELECT * FROM sku", []any{})
If you need to pass options for the new dataframe, use SQLOptionDataframeOptions(options...)
.
Filtering is done with df.Filter(whicher)
. Two fundamental whichers are []int
with elements indices and
[]bool
. Filter()
filters out elements corresponding to false
. In most cases you do not need to pass
[]int
or []bool
directly as they are returned by many column functions.
Important: in Logarithmothechnia first index is 1! (This betrays R's roots of Logarithmotechnia).
Let's select elements with "setosa" in "species" vector (column).
filteredDf := iris.Filter(iris.C("species").Eq("setosa"))
Or those with sepal length greater than 5.
filteredDf := iris.Filter(iris.C("sepal_length").Gt(5))
Other comparing functions are: Neq()
, Lt()
, Gte()
, Lte()
.
It is possible to filter by using vector Which()
function.
What if you need to filter by two conditions at the same time? Here is two ways to do this:
filteredIris := iris.Filter(iris.C("species").Eq("setosa")).Filter(iris.C("sepal_length").Gte(5))
Or
filteredIris := iris.Filter(vector.And(
iris.C("species").Eq("setosa"),
iris.C("sepal_length").Gt(5),
))
The second approach is more general. What if you need to select all elements which are either of "setosa" species
or have sepal length more than 5? It is easy to do by changing vector.And
to vector.Or
.
filteredIris := iris.Filter(vector.Or(
iris.C("species").Eq("setosa"),
iris.C("sepal_length").Gt(5),
))
It is also possible to filter by passing a function to column's Which()
.
filteredIris = iris.Filter(iris.C("sepal_length").Which(
func(val float64) bool {
return val >= 5 && val < 7
},
))
Function has to have a signature supported by the vector (column) type.
Select rows from 10th to 20th (including).
subsetIris := iris.FromTo(10, 20)
To sort a dataframe use Arrange()
function. For example,
sortedBySepalLength := iris.Arrange("sepal_length")
In reverse:
sortedBySepalLengthReverse := iris.Arrange("sepal_length", dataframe.OptionArrangeReverse(true))
By two columns:
sortedBySepalLength := iris.Arrange("series", "sepal_length")
Mutate()
allows creates a new data frame with new columns, but preserving all columns of the old one.
For example, let's add a column which indicate a one of two buckets based on the sepal length.
bucketed := iris.Mutate(dataframe.Column{
"bucket",
iris.Cn("sepal_length").Apply(
func(val float64) int {
if val < 5 {
return 1
}
return 2
},
),
})
Here you can also an example of vector's Apply()
function which allows to generate a new vector from the other
one.
Select()
function allows selecting and dropping dataframe's columns.
Let's select species and sepal length from iris dataset.
compactIris := iris.Select("species", "sepal_length")
Or just drop petal length and petal_width.
compactIris := iris.Select("-petal_length", "-petal_width")
It is also possible to use column indices instead of names.
compactIris := iris.Select(5, 1)
Make "species" column appear before "sepal_length":
relocated := iris.Relocate("species", dataframe.OptionBeforeColumn("sepal_length"))
Or "petal_length" and "petal_width" after "species":
relocated := iris.Relocate("petal_length", "petal_width", dataframe.OptionAfterColumn("species"))
There are several types of joins available: InnerJoin()
, LeftJoin()
, RightJoin()
, FullJoin()
,
SemiJoin()
and AntiJoin()
. Last two are from dplyr. Here is an example of left join:
joined := employee.LeftJoin(department, OptionJoinBy("DepType"))
More examples of the joins can be found in tests.
Columns (and stand-alone vectors) can be converted to slices. For example:
data, na := iris.Cn("species").Strings()
If an element of na is true, it means a corresponding element of the column is NA-value.
Available converting functions are
Booleans() ([]bool, []bool)
Integers() ([]int, []bool)
Floats() ([]float64, []bool)
Complexes() ([]complex128, []bool)
Strings() ([]string, []bool)
Times() ([]time.Time, []bool)
Anies() ([]any, []bool)
There are also similar functions for converting a vector to other type:
AsInteger(options ...Option) Vector
AsFloat(options ...Option) Vector
AsComplex(options ...Option) Vector
AsBoolean(options ...Option) Vector
AsString(options ...Option) Vector
AsTime(options ...Option) Vector
AsAny(options ...Option) Vector
Another way is to use Apply()
function as shown before.
To rename a column, use a Rename()
function. There are several ways to pass which column to which value you
would like to rename (check function comment). For example:
renamedIris := iris.Rename([]string{"sepal_width", "s_width"})
Let's suggest you have a bucketed by "sepal_length" dataframe from the example above, and you want to find out max and min values for "petal_length" for every bucket. It is somewhat cumbersome for now as this is two-step operation. First, we group our bucketed dataframe by necessary columns:
grouped := bucketed.GroupBy("bucket")
Then we summarize it:
stats := grouped.Summarize(
grouped.C("petal_length").Min(),
grouped.C("petal_length").Max(),
)
fmt.Println(stats)
And we get the result:
# of columns: 3, # of rows: 2
petal_length_min: [(float)]1.200, 1.000]
petal_length_max: [(float)]6.900, 4.500]
bucket: [(integer)]2, 1]