|
| View previous topic :: View next topic |
| Author |
Message |
Ju PENG
Joined: 22 Apr 2013 Posts: 1
|
Posted: Mon Apr 22, 2013 4:07 am Post subject: Any idea or suggestion for my projet data mining? |
|
|
I am doing my job in te filed classification et data mining, here is my issue:
Context:
There are about 2 million invoice need to be classified, all these invoices are in format image, but we have already extracted the data from the image and export in the file XML, so we have 2 million files in format XML. Each invoice has a provider, and among them, most of them have a number key to identify the provider but others dont. And there are also other informations useful for example the adresse mail, the site(not all the papers have). For each provider, the model is not the same, so the structure is different. My job is to classify all these invoices with their provider.
At first, i used the number of provider to classify all the documents, and it worked on 80% of all the files. And then, i used the site or adresse mail which has be associated with the number, and it solves 10% more. But i have no idea what to do next. Because the data was extracted by the method OCR from files images. So there are some words in bad format(images not clear, handwriting).
Now i think it is better to classify the files by their structure, but i dont work it out. So do u have any good idea? Thank u!!! |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|
|
|