@craig.emmerich
Apache Spark is a popular distributed computing framework used for big data processing, while MLlib is its machine learning library. Here are the basic steps for working with MLlib in PySpark:
- Import the necessary libraries:
1
2
3
4
5
|
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
|
- Create a SparkSession:
1
2
3
|
conf = SparkConf().setAppName("MyApp")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
|
- Load your data into a DataFrame:
1
|
df = spark.read.format("csv").option("header", "true").load("mydata.csv")
|
- Prepare your data for machine learning using Transformers and Estimators:
1
2
3
4
|
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol="words", outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
|
- Split your data into training and test sets:
1
|
(trainingData, testData) = df.randomSplit([0.7, 0.3], seed=123)
|
- Fit the model on the training data:
1
|
model = pipeline.fit(trainingData)
|
- Make predictions on the test data:
1
|
predictions = model.transform(testData)
|
- Evaluate the model:
1
2
3
4
|
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
auc = evaluator.evaluate(predictions)
print("AUC = ", auc)
|
- Stop the SparkSession:
These are the basic steps to work with MLlib in PySpark. Of course, there are many more things you can do, like cross-validation, hyperparameter tuning, and more. But this should give you a good starting point.