Ranger User Sync Info – LDAP Settings

To import users from LDAP servers like Windows AD or OpenDJ follow the below steps :

Navigate to Ranger →Configs →Ranger User Info

Enable User Sync = True (Enable it)

In Common Configs

 

  • LDAP/AD URL = ldap://ldapserver:1389 ( if you have secure LDAP, then use ldaps url)
  • Bind User = cn=Directory Manager
  • Bind User Password = password
  • Incremental Sync = Off (Disable for now)

In User Configs

  • Username Attribute = uid or cn
  • User Object Class = person or inetOrgPerson
  • User Search Base = dc=company,dc=com
  • User Search Filter = *
  • User Search Scope = sub
  • User Group Name Attribute = ismemberof
  • Group User Map Sync = On ( Enabled )

In Group Configs

  • Enable Group Sync = On ( Enabled ) — Only if you want to import groups, which would most likely be the case in many large organisations
  • Group Member Attribute = cn
  • Group Name Attribute = uniqueMember
  • Group Object Class = groupOfUniqueNames
  • Group Search Base = dc=company,dc=com
  • Group Search Filter = cn=*
  • Enable Group Search First = Off
  • Sync Nested Groups = Off — (Enable if you have nested group hierarchy in your user base)
Advertisements

HDFS LDAP Settings – Core Site XML

For configuring HDFS to use LDAP for user authentication and authorization, make changes to core-site.xml file located inside of /usr/hdp/config directory.

In Ambari managed Hadoop implementation, navigate to following location

HDFS →Config →Advanced →Custom core-site and add the following tags

hadoop.security.group.mapping=org.apache.hadoop.security.LdapGroupsMapping
hadoop.security.group.mapping.ldap.url=ldap://ldapserver:389
hadoop.security.group.mapping.ldap.base=dc=company,dc=com
hadoop.security.group.mapping.ldap.bind.user=cn=Directory Manager
hadoop.security.group.mapping.ldap.bind.password=password
hadoop.security.group.mapping.ldap.search.filter.user=(objectClass=inetOrgPerson)
hadoop.security.group.mapping.ldap.search.filter.group=(objectclass=groupOfUniqueNames)
hadoop.security.group.mapping.ldap.search.attr.member=uniqueMember
hadoop.security.group.mapping.ldap.search.attr.group.name=cn

Note the ( ) in ldap.search criteria for user and group.

Replace the ldapserver with the hostname or IP. Also enter the password as per the bind user.

Save the configuration and restart the effected services.

Hadoop Security

Create a centralized user repository – Windows AD / Open DJ

  • Create a domain or use existing domain ( e.g. company.com )
  • Create a new OU or use existing OU ( e.g. OU = People )
  • Add new group to the OU ( e.g. hadoopadmins, uidevelopers, datascientists)
  • Add users to those groups
  • Note the following
  • LDAP server IP : port
  • Base DN ( e.g. dc=company, dc=com)
  • LDAP administrator : ( e.g. cn=Directory Manager)
  • LDAP password :
  • Attributes for groupname and username : cn & samAccountName or person

TIP : User LDAP Browser or AD Browser tools for ease of identifying the properties and user management.

Import or Map the user repository with following components

3 Tech Guys who are amazing at what they do

Following 3 guys are good at what they do :

James Rapp

James Rapp is a Technical Specialist within the SAP Products & Innovation group. He has conducted training on openSAP and worked on many BI implementations for clients.

Sam Yen

Sam Yen is the Head of Commercial Real Estate Digital at JP Morgan Chase & Co (JPMC). Prior to JPMC, Sam was the Managing Director for SAP Silicon Valley and was SAP’s first Chief Design Officer responsible for driving SAP’s Design and User Experience

Ben Sullins

Ben Sullins has a true data geek with expertise in various tools and technologies related to Data Science and BI. He conducts training on LinkedIn available via Lynda.com

https://bensullins.com/about/

I took the following online training on below subjects :

BO 4.0 Implementation, Configuration and Administration, James Rapp

Design Principles in SAP UI5 and others, Sam Yen

Data Science courses, Ben Sullins

Elastic Search Getting Started – Basics

  • Near Real Time
    • There is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.
  • Shards
    • When you create an index, you can simply define the number of shards that you want.
    • It allows you to horizontally split/scale your content volume
    • It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput
  • Replicas
    • Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short
    • It provides high availability in case a shard/node fails.
    • It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.

It is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.

The number of shards and replicas can be defined per index at the time the index is created.

The number of shards and replicas can be defined per index at the time the index is created.

You can change the number of shards for an existing index using the _shrink and _split APIs

By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica which means that if you have at least two nodes in your cluster, your index will have 5 primary shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.

Each Elasticsearch shard is a Lucene index. The maximum number of documents you can have in a single Lucene index is 2,147,483,519.

More on the official documentation site for Elastic Search

Spark – The Beginners Guide

Apache Spark start at UC Berkeley.

Spark Components

  • Spark SQL
  • Spark Streaming
  • MLlib
  • GraphX
  • SparkR

Spark Core

  • Task distribution
  • Scheduling
  • Input/Output

Spark SQL

  • Uses DataFrames

Spark Streaming

  • Uses Micro batches
  • Uses Lambda architecture – take old historic data and then add the changes/delta

MLlib

  • Uses benchmark tests

GraphX

  • In-memory version of Apache Giraph
  • Based on RDDs

SparkR

  • R package for Spark
  • Distrubed DataFrame
  • R Studio Integration

 

Spark Use Cases

  • Data Integration
  • Machine Learning
  • BI/Analytics
  • Real-Time Processing
  • Recommendation Engines

 

Languages used in Spark

  • Scala
  • SQL
  • Python
  • R
  • Java

 

Databricks is a cloud-based managed platform for running Apache Spark

Databricks was created by UC Berkeley