skip to main | skip to sidebar

Python Programs and Examples

Pages

  • Home
 
  • RSS
  • Facebook
  • Twitter
Friday, October 19, 2012

Python script to Validate Tabular format Data and database restore

Posted by Raju Gupta at 4:20 AM – 0 comments
 

Our python utility does all these functionality.

Major functionality:

  • Process a batch of large cvs/text files 
  • Can iterate in sub-folders 
  • File can be filtered on criteria expressed in regular expression format 
  • Reading cvs/text files 
  • Performing sanity and basic column based validation checks like null values, duplicate values, max, min, unique values etc. 
  • It also infer data type for all columns 
  • SQL Server connection 
  • On the fly table creation and bulk insertion for any given text/cvs file 
  • Logging of all activities using a separate logger module 
  • Highly configured utility 

Main.py:

import Utility
import Logger
import FieldTypes
import SQLServerConnection

import datetime
import os
import re
import string
import pyodbc


#start time
Logger.LogMessage(str(datetime.datetime.now()))
dbConn = SQLServerConnection();
temp = Utility()

#read source directory
reader = open("srcdir.txt","rb")
srcdir = reader.readline()

iterate source directory
temp.iterateFolder(srcdir)
temp.reportProfile()
print "profiling done"

reader.close();

#end time
Logger.LogMessage(str(datetime.datetime.now()))
print datetime.datetime.now()


Utility.py

import SQLServerConnection
import Field
import csv
import os
import re
import string
from Logger import Logger
import datetime



class Utility:
    fileName = ""
    
   
    def __init__(self):
        #initialise class variable
        
        #map for fieldName -> fieldInformation
        self.data = dict()
        self.profileOutput = open('C:\\temp\\profile.csv','w')
        self.output = ['FileName','FieldName','InferredDataType','Len','TotalRows','NullValues','UniqueValues']
        self.profileOutput.write(','.join(self.output))
        self.profileOutput.write('\n')
        self.profileList = []
        
        
    def readcvsFile(self, fileName, onlyName):
        ROWSEPARATER = "\n"
        FIELDSEPARATER = "\t"
        self.data = dict()
        
        #read as a text file
        reader = open(fileName,"rb")
        headers = reader.readline().split(FIELDSEPARATER)

        #header row
        Column = len(headers)
       
        #initialised the map
        for column in headers:
            self.data[column]=Field.Field(column.rstrip())
        Logger.LogMessage("No of column:        " + str(Column))
        
        #process every row
        RowCount = 0
        ErrorRow = 0
        for rowstring in reader:
            RowCount = RowCount + 1
            rowCol = rowstring.count(FIELDSEPARATER) + 1
            row = rowstring.split(FIELDSEPARATER)
            #need to check value of last col
            #if it has "," it means there are more delimiter in text file
            if rowCol == Column:
                i =0
                for column in row:
                    (self.data[headers[i]]).addValue(column)
                    i = i + 1
            else:
                Logger.LogMessage("Row number " + str(RowCount) + " is not valid. It has " + str(rowCol) + " columns.")
                ErrorRow = ErrorRow + 1
                   
        Logger.LogMessage("No of Rows processed(Other than header file):"+str(RowCount))
        Logger.LogMessage("Total Errors:                                "+str(ErrorRow))
        reader.close();
        
        
        name, extention = onlyName.split(".")
        createFile = "create table " +  name + "("
        fieldCount = 1
        
        for k,v in self.data.iteritems():
            #mark end of processing and take profiled output for each field
            v.endOfLoading()
            self.profileList.append([fileName, v.FieldName, v.FieldString, str(v.FieldSize), str(v.RecordSize), str(v.NullValues), str(v.UniqueValues)])
            if fieldCount > 1:
                createFile = createFile + ", "
            createFile = createFile + v.FieldName + " varchar(1000) "
            
            fieldCount = fieldCount + 1
            v.clear()

        createFile = createFile + ")"
        Logger.LogMessage("Sql for file creation:" + createFile)
        SQLServerConnection.cursor.execute(createFile)
        SQLServerConnection.conn.commit()
        
        bulkcopySQL = "BULK INSERT " + name + " FROM '" + fileName + "' WITH ( FIELDTERMINATOR='\t',FIRSTROW=2,ROWTERMINATOR='" + chr(10) + "')"
        Logger.LogMessage("running bluk copy:" + bulkcopySQL)
        SQLServerConnection.cursor.execute(bulkcopySQL)
        SQLServerConnection.conn.commit()
        
        del self.data
        
        

    def iterateFolder(self,dir):
        fileexp = re.compile(r'\w*\.csv')
        
        #iterate for the directory
        for f in os.listdir(dir):
            
            if os.path.isfile(os.path.join(dir,f)) and fileexp.match(f) is not None:
                Logger.LogMessage("***************************************")
                Logger.LogMessage("FileName:             " + f)
                Logger.LogMessage("Directory Name:      " + dir)
                self.readcvsFile(os.path.join(dir,f), f)
                Logger.LogMessage("*************************************")
                Logger.LogMessage(str(datetime.datetime.now()))
                Logger.LogMessage(" ")
                print "Processing " , f
               
            
                
                Logger.flush()
            elif os.path.isdir(os.path.join(os.getcwd(),f)): 
                print os.path.join(dir,f)
                self.iterateFolder(os.path.join(dir,f))
          
                
    
    def reportProfile(self):
        
        for detail in self.profileList:
            self.profileOutput.write(','.join(detail))
            self.profileOutput.write('\n')
            


Field.py:

from types import *
import re
import datetime
import Logger

class FieldTypes:
    Null,Integer, Float, Varchar, Date, DateTime = range(6)
    intexp = re.compile('[0-9]+$')
    floatexp = re.compile('[0-9]*\.[0-9]+$')
    dateexp = re.compile(r'[0-9]{1,4}([.,/-])[0-9]{1,2}\1[0-9]{1,4}$')
    
    
     #this functions returns corresponding string for enum type 
    def getEnumString(self, type):
        if type == self.Integer: 
            return "Integer"
        if type is self.Float:
            return "numeric"
        if type is self.Varchar:
            return "varchar"
        if type is self.Date:
            return "date"
        if type is self.Null:
            return "null"
    

 #this function infers datatype using combination of new value and calculated datatype so far    
    def fieldType(self,value, previousType):

        if len(value) is 0:
            return previousType
        
        if previousType is self.Float:
            if self.floatexp.match(value) is not None:
                return self.Float
            
        if previousType is self.Date:
            if self.dateexp.match(value) is not None:
                return self.Date
            
        if previousType is self.Null:
            if self.intexp.match(value) is not None: 
                return self.Integer
            if self.floatexp.match(value) is not None:
                return self.Float
            if self.dateexp.match(value) is not None:
                return self.Date
            
        if previousType is self.Integer:
            if self.intexp.match(value) is not None: 
                return self.Integer
            if self.floatexp.match(value) is not None:
                return self.Float
            if self.dateexp.match(value) is not None:
                return self.Date

        return self.Varchar
            
          
    def getFieldType(self,valueList):
        Type = self.Null
        for value in valueList:
            if Type is self.Varchar:
                break
            Type = self.fieldType(value, Type)
        del valueList
        return Type
            

#Field class represent a complete field domain in the table. It has all properties
#related to a field like name, datatype, size, min value etc. It has contains all 
#list of values for that particular field.
class Field:
   
    def __init__(self, name, storeValue=0):
        self.FieldName = name
        self.FieldSize = 0
        self.FieldType = FieldTypes.Null
        self.FieldValues = set()
        self.RecordSize = 0
        self.UniqueValues = 0
        self.FieldString = ""
        self.NullValues = 0
        self.storeVal = storeValue
       
    def addValue(se1lf,val):
        
        self.RecordSize = self.RecordSize +1
        #print self.FieldValues
        if len(val) is 0:
            self.NullValues = self.NullValues + 1
        else:
            if self.storeVal is not 0:
                self.FieldValues.add((val.rstrip()))
            if len(val) > self.FieldSize:
                self.FieldSize = len(val)
        
    def clear(self):
        del self.FieldValues

    
    def endOfLoading(self):
        if self.storeVal is not 0:
            self.FieldType = FieldTypes().getFieldType(self.FieldValues)
            self.UniqueValues = len(self.FieldValues)
            self.FieldString = FieldTypes().getEnumString(self.FieldType)
        
    def printSummary(self):
        Logger.LogMessage(self.FieldName + " " +  self.FieldString+ " " +str(self.FieldSize)+ str(self.RecordSize)+ str(self.NullValues)+ str(self.UniqueValues)) 
        
    def printValues(self):
        for value in self.FieldValues:
            print value
            


Labels: Python Date Example Email This BlogThis! Share to X Share to Facebook

Leave a Reply

Newer Post Older Post
Subscribe to: Post Comments (Atom)
  • Popular
  • Recent
  • Archives

Popular Posts

  • To Send the entire contents of directory as an email Message.
    Here is a Python Program to send the entire contents of a directory as an email message #!/usr/bin/env python """Send the...
  • Control Structures in Python
    This Program explains different control structures in python If ==== # basic "If" if boolean: x = 1 else: x = 0 # No...
  • Python Code for creating Screen saver
    This is the Python code for creating a screensaver of a moving ball. The ball will bee moving along the screen and when it hits any of the ...
  • Python script for walking the directory tree structure but excluding some directories and files
    The script walks the directory structure recursively. Some of the directories and files can be excluded from the walk by providing them in ...
  • XML to CSV Converter
    Python Program to extract each execution data (few children in xml) and create a csv file. def xml2csv(inputFile,outputFile,elemName,n...
  • HTML Email using Python
    We created a module called sendmail which email html content. Each of application dumps data from the datastructure into html format and th...
  • Overview of Python Programming Language
    Python is an easy to learn, powerful programming language. It has efficient high level data structures and a simple but effective approac...
  • Fast and efficient Backup script that works on Windows and Linux
    This python script backup your directory structure that works on Linux and Windows both. It directly uses the shell to do this task and she...
  • Run Application from host
    import os os.execl( "c:/Windows/Notepad.exe", "c:/userlog.txt") print "Running notepad" #or import subpro...
  • Orphan Checker
    The script does not use any outside libraries, thus keeping its execution simple. we will start by initializing the arrays that we'll...
Powered by Blogger.

Archives

  • ▼  2012 (66)
    • ▼  October (28)
      • Find text/pattern under parent folder from matched...
      • XML to CSV Converter
      • XSL to CSV converter
      • Spell Corrector in Python
      • Format a plain ASCII file using Python script
      • Control Structures in Python
      • Python script to zip and unzip files
      • Fast and efficient Backup script that works on Win...
      • File Splitter Script
      • Python script for walking the directory tree struc...
      • Python XML Parser
      • IP Address Generation
      • Orphan Checker
      • HTML Email using Python
      • Python interface to load commands.
      • Python script to Validate Tabular format Data and ...
      • Python script to Load Configuration file content i...
      • Python Script to search file for specific content
      • Network Wise File Concatenation
      • Python Script for Adding or Subtracting the Dates
      • Program to create make file in VMS/UNIX
      • Python Code for creating Screen saver
      • Run Application from host
      • Multithreading in Python
      • This utility determines the size of a folder and i...
      • Using UNIX Environment Variables From Python
      • CVS Macro for Commited Files
      • Generating unique id's
    • ►  September (36)
    • ►  March (1)
    • ►  February (1)
 

Labels

  • Command Execution Using Python (2)
  • Control Structure (1)
  • CSV Examples (4)
  • Database (3)
  • Features of Python Programming Language (1)
  • File Example (2)
  • IF Statement (1)
  • List (2)
  • Mail (5)
  • Multithreading (1)
  • MySQL (3)
  • Python Date Example (2)
  • Python Example (2)
  • Python File Example (7)
  • Python Introduction (1)
  • Python Network Example (1)
  • Python XML Parser (2)
  • Searching Example (2)
  • Simple Calculation Program (1)
  • Strings (4)
  • Zip File Example (1)

Followers

 
 
© 2011 Python Programs and Examples | Designs by Web2feel & Fab Themes

Bloggerized by DheTemplate.com - Main Blogger

  • Tweet
  • t