Как работать с MLlib в PySpark?

Пользователь

от craig.emmerich , в категории: Python , 8 месяцев назад

Как работать с MLlib в PySpark?

Facebook Vk Ok Twitter LinkedIn Telegram Whatsapp

1 ответ

Пользователь

от jeremy_larkin , 8 месяцев назад

@craig.emmerich 

Apache Spark is a popular distributed computing framework used for big data processing, while MLlib is its machine learning library. Here are the basic steps for working with MLlib in PySpark:

  1. Import the necessary libraries:
1
2
3
4
5
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer


  1. Create a SparkSession:
1
2
3
conf = SparkConf().setAppName("MyApp")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)


  1. Load your data into a DataFrame:
1
df = spark.read.format("csv").option("header", "true").load("mydata.csv")


  1. Prepare your data for machine learning using Transformers and Estimators:
1
2
3
4
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol="words", outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])


  1. Split your data into training and test sets:
1
(trainingData, testData) = df.randomSplit([0.7, 0.3], seed=123)


  1. Fit the model on the training data:
1
model = pipeline.fit(trainingData)


  1. Make predictions on the test data:
1
predictions = model.transform(testData)


  1. Evaluate the model:
1
2
3
4
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
auc = evaluator.evaluate(predictions)
print("AUC = ", auc)


  1. Stop the SparkSession:
1
spark.stop()


These are the basic steps to work with MLlib in PySpark. Of course, there are many more things you can do, like cross-validation, hyperparameter tuning, and more. But this should give you a good starting point.