Overall System Architecture

[PeopleMatchAI](https://github.com/softsky/people-match-ai) designed for fast determination of purchase capabilities using person profile.

Persons profile looks like this:

{
	"_id": {
		"$oid": "58ea248524dce09f41209710"
	},
	"searchResult": [
		{
			"otherEmails": [],
			"country": "United States",
			"source": "PeopleData",
			"lastUpdated": "2016-09-26T16:54:28Z",
			"id": "AVdO_GcbPKMaPmPU7jEY",
			"state": "arizona",
			"probability": 1,
			"query": "MANUAL_REQUEST",
			"firstName": "John",
			"phone": "1-520-247-9050",
			"lastName": "Soukup",
			"stateAbbr": "AZ",
			"activity": null,
			"gender": "M",
			"city": "Tucson",
			"result": {
				"resultStatus": "OK",
			},
			"term": "[1]",
			"mergedIdentities": [
				{
					"source": "EX",
					"datetime": "2016-06-23T00:00:00Z",
					"lname": "Soukup",
					"class": "com.selerityfinancial.person.peopledata.dto.PeopleDataPerson",
					"email": "c.l.soukup@comcast.net",
					"fname": "John",
					"address": {
						"zip": "85749",
						"city": "Tucson",
						"streetAddress": "9352 E Vallarta Trl",
						"state": "AZ",
						"class": "com.selerityfinancial.person.peopledata.dto.PeopleDataAddress"
					},
					"ip": [
						"71.226.126.234"
					],
					"phone": [
						"1-520-247-9050"
					],
				},
    ]}
]}

Comments on Person Profile

mergedDetails section contains the array of profile data collected from different places
it may contain duplicated
it may contain empty records

System support over 240 million profiles of US citizens.

Requirements

Scalability: System architecture should be scalable and high performant (in terms of traininig and evaluation).
High Performance: Since there will be multiple concurrent searches, and every search will take valuable amount of time, previous results should be cached. Cache will be wiped every time main database of people profiles is updated. We should monitor for most frequent searches and cache results only for them, while searches performed one or few times won’t be cached to save memory and drive space

Other requirements:

Model, once trained for some search purpose could be easily distributed between evaluation nodes I’d suggest using of dockerized containers (some for training and other for realtime evaluation).

package "Training" {
node "Training Box" as TB {
  [Model To Train]
  [Special Purpose Algorithm]
  [GPU Rack]
} 
}

cloud "Evaluation" {
node "Evaluation Box #1" as EB1{
  [Model] as EB1_M
  [Special Purpose Algorithm] as EB1_SPA
  [GPU Rack] as EB1_GPU
} 
node "Evaluation Box #2" as EB2{
  [Model] as EB2_M
  [Special Purpose Algorithm] as EB2_SPA
  [GPU Rack] as EB2_GPU
} 
node "Evaluation Box #3" as EB3{
  [Model] as EB3_M
  [Special Purpose Algorithm] as EB3_SPA
  [GPU Rack] as EB3_GPU
} 
node "Evaluation Box #N" as EBN {
  [Model] as EB4_M
  [Special Purpose Algorithm] as EB4_SPA
  [GPU Rack] as EB4_GPU
} 

}

TB --> EB1
TB --> EB2
TB --> EB3
TB --> EBN

Basically system architecture will look like that:

Training Box

Training Box performs following operations:

New Profiles Crawling
Itentity Match and Profile Merging
Profile Enhancement
Model training
Model distribution across Evaluation Nodes

Training Box use Google Tensor Flow as AI

Old and new profiles form `Train Corpus` which is used by TensorFlow to create new `Models`

package "Master" {
  node "Spark" {
   [CSVImporter] - Importer:implements
   [JSONImporter] - Importer:implements
   [DBResultExporter] - Exporter:implements
  }
  node "Hadoop" {
     [out]"HDFS folder"  ..> [BatchPredictionResult.csv]:contains
     [in]"HDFS folder" ..> [InputData.csv]:contains
  }
 Spark -> Hadoop:use
}

Deployment

component [Train Corpus] as Corpus

node {
 component [TensorFlow] as TF
 component [Model To Train] as Model
}

Corpus --> TF
TF --> Model

Operaional Sequence is shown here:

Sequence

Controller --> Crawler: Crawl Profiles
Controller <-- Crawler: crawled Profiles
Controller --> IdentityMatcher: Match identities and merge profiles
Controller <-- Crawler: merged Profiles
Controller --> Enhancer: Enhance Profile
Controller <-- Enhancer: enhanced Profiles
Controller --> DB: Update database with enhanced Profiles

Controller --> Train: Re-train DB with new profile corpus
Controller --> Network: Distribute updated Model through Evalutaion nodess

Identity Match

Crawling is performed over multiple resources. We need the way to properly match identities and merge their profiles. We might use email or phone as unique intentifier, since name won’t always work. Since some resources might not return unique identifier we use AI comparing fiels.

Controller --> IdentityMatcher: Sends unmatched Profiles for similarity check
IdentityMatcher --> TensorFlow: performs field analyzis to determine similarity
IdentityMatcher <-- TensorFlow: sends back result for each pair of Profiles
IdentityMatcher --> ProfileMerger: Sends profile pairs to be merged
ProfileMerger --> DB: updates DataBase with merged profiels

Training box will be used most of the time to train all special purposes models using probably slightly modified Inception v3 alorigthm. Traning it from scratch is time consuming operation, however once all special purpose algos and models are trained it could be put down to save hosting cost and be running only once it’s needed next time for next alorithm/model train. We will apparently have several purposes (so models and algos) depending on type of information consumers need to receive as the result of their searches.

Evaluation Box

Evaluation boxes will also be used all the time, they will serve large datasets searching for appropriate data according to consumer’s search.

[Profiles] as Profiles
[Model] as Model

node {
 component [TensorFlow] as TF
 [Controller] as Controller
}

component [Evaluation Result] as Result

Profiles --> Controller
Model --> Controller
Controller --> TF 
TF --> Controller
Controller --> Result

System Requirements

System hardware requirements

Here are system software requirements

Training box: is distributed among 5 high performant 128Gb/10TB 6*12 Nvidia GPU machines
Evaluation boxes: depends on number of concurrent searches and overall database size and complexity (in terms of fields in persons profiles)

System Software Requirements

Here are system software requirements

OS: Amazon/Ubuntu Linux with recent 4.x kernel
DB: MongoDB 3.3+
AI: Google TenserFlow 1.x
JVM: v1.8 or higher

How to run

From the project directory

docker-compose up
docker exec -ti peoplematchai_master_1 bash
spb run

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
Resources		Resources
hadoop-docker @ e8fc7c7		hadoop-docker @ e8fc7c7
hdfs-handlers @ 4c3323f		hdfs-handlers @ 4c3323f
people-match-ai-ui @ 6b60535		people-match-ai-ui @ 6b60535
spark-docker @ 47581ad		spark-docker @ 47581ad
src/main		src/main
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
HOW_TO_RUN.md		HOW_TO_RUN.md
README.html		README.html
README.org		README.org
SystemDesign.html		SystemDesign.html
docker-compose-template.yml		docker-compose-template.yml
docker-compose.yml		docker-compose.yml
person_profile.json		person_profile.json
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overall System Architecture

Comments on Person Profile

Requirements

Other requirements:

Training Box

Deployment

Sequence

Identity Match

Evaluation Box

System Requirements

System hardware requirements

System Software Requirements

How to run

About

Releases

Packages

Languages

softsky/people-match-ai

Folders and files

Latest commit

History

Repository files navigation

Overall System Architecture

Comments on Person Profile

Requirements

Other requirements:

Training Box

Deployment

Sequence

Identity Match

Evaluation Box

System Requirements

System hardware requirements

System Software Requirements

How to run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages