Author Topic: Capture NGC specimens  (Read 4180 times)

0 Members and 1 Guest are viewing this topic.

Offline dc55232

  • Junior Member 50+ posts
  • **
  • Posts: 67
Capture NGC specimens
« on: 2013 Mar 03, 09:26:48 AM »
Dear badon,

I am currently running a project to automatically capture as many NGC specimens as possible. I spent last weekend to write some code, export CC data as a seed and ran the code during last and this weekend. So far I captured about 130,000 specimens with NGC label data (number, description, denomination and grade) and  - if applicable - links to the images. About 105,000 of them do have images. I do this to support running and upcoming research projects.
For example so far I captured 261 specimens with images of the 1989 3.3 oz silver God of War and Wealth.

I have no idea how to enter all that data into the CC but I am working on a semi-automatic type assignment ...

Regards
dc55232
 

Offline badon

Re: Capture NGC specimens
« Reply #1 on: 2013 Mar 03, 06:25:00 PM »
The CC team can help sort the types by varieties, if it requires checking the images. You don't need to do that task. Have you used the advanced specimen entry form features? SANDAC has much experience with it. Just click the normal link to add a specimen, and then click on the various areas that indicate advanced features. Not all of them display a clickable link mouse cursor, but click them anyway. As long as all the specimens you're entering are the same type, grade, and certification company, they can be entered very quickly. The CC team can do that for you too, once you have sorted lists of certification numbers, but it might be easy enough for you to at least try it out yourself to see how it works.

NGC is grading specimens at a rate of about 1 every 10 seconds during an 8 hour work day. The only way we can capture all those is by combining automated tools with human efforts. I wasn't aware you're a programmer, dc55232. There may be ways to integrate these tools you're making more closely into the CC, such as displaying an image and offering a choice to make it simple to assign variety types. What languages do you use?
 

Offline badon

Re: Capture NGC specimens
« Reply #2 on: 2013 Mar 04, 01:45:27 AM »
Have you used Jitbit Macro Recorder before? It is very useful for quick tasks that I don't want to do much programming for. It allows you to essentially create a bot to do repetitive tasks for you, very quickly. One thing I use it for is copying and pasting data into different form fields, and then submitting the form. The trick to getting it to work reliably is to have it detect images that only load when a web page is fully loaded and ready to go. This is the magic trick:

REPEAT : 999
IF IMAGE :
EXIT LOOP
ENDIF
DELAY : 500
ENDREPEAT

That checks for some part of the screen to be loaded every 500 milliseconds. When it finds it, the loop is exited and the rest of the task can proceed.

The only downside is it only works on Windows. For linux/windows/BSD, Actionaz looks to be the most promising.

You can find more alternatives here:

http://alternativeto.net/software/actionaz/

I tried most of them, and Jitbit was the best by far. Actionaz was the only cross-platform one that seemed to be nearly as capable as Jitbit, but I haven't tested it much yet (I use Jitbit).
 

Offline badon

Re: Capture NGC specimens
« Reply #3 on: 2013 Mar 10, 12:31:16 AM »
There is also an API in the CC that can be used to enter this data much more efficiently. I'm surprised I forgot to mention that. Maybe we can glue the CC together with your software via Pywikipediabot, and get the data entered as fast as the CC can save it.
 

Offline dc55232

  • Junior Member 50+ posts
  • **
  • Posts: 67
Re: Capture NGC specimens
« Reply #4 on: 2013 Mar 11, 11:27:13 AM »
badon,

are you refering to the script "pagefromfile.py"? I don't know if it really fits. I am sure that I can generate XML files. So we could use import tools ...

Regards
dc55232
 

Offline badon

Re: Capture NGC specimens
« Reply #5 on: 2013 Mar 11, 11:20:47 PM »
No, none of those methods will work out of the box. We have to use a custom solution, and with the quantity of data involved each day, it's going to have to be fully automated. What it will end up doing is searching for pre-existing specimens, and if found, skip it or verify it's entered correctly. The same code could be used in a separate bot to verify the CC's entire specimen collection, detect changes worthy of human interest, flag errors and/or correct them, and cache verification data for optional human review.

Sightings can be verified automatically like that too. Proceeding to increased levels of automation will greatly improve the accuracy and efficiency of our data entry team, and allow the CC to enter phase 2. Behind the scenes we've already been preparing a strategy for handling the "data explosion" you're initiating (a little earlier than planned). If we succeed in accommodating it, we can start hacking up the site's code to make it more user-friendly, and produce more useful data analysis. The data entry team can divert their attention to new, unexplored challenges.

I can make arrangements to enable automated entry of most sightings, which is currently the most challenging task the CC team does. Instead of entering everything manually, they will be able to simply verify automatically entered data, and focus on fixing errors and adding data that can't be entered automatically. That should allow us to move much closer to 100% market data coverage for our current CCT407 project.

It's very exciting to see the dream becoming a reality! Things are going to happen fast from here...the critical mass is forming. This is going to be stressful, and I think I'm going spend a few more sleepless nights making sure this works like we've been hoping. Everything we need to succeed is already in place and battle-tested for the last year. Ready or not, here it comes!

Phase 2.....
 

Offline dc55232

  • Junior Member 50+ posts
  • **
  • Posts: 67
Re: Capture NGC specimens
« Reply #6 on: 2013 Mar 17, 10:46:11 AM »
badon,

I am now close to 480000 specimens  :o

For example I captured 348 specimens of the 1989 Silver God of War & Wealth. Other samples indicate that I am now at about 90% coverage of all NGC graded specimens.

Regards
dc55232


« Last Edit: 2013 Mar 17, 10:48:18 AM by dc55232 »
 

Offline badon

Re: Capture NGC specimens
« Reply #7 on: 2013 Mar 17, 12:27:13 PM »
OK, are you able to share your data and/or your software, for everyone's benefit? If not, that's OK, we'll probably be able to duplicate it when I allocate some time for working on it. In that case, it would be helpful if you could describe what works and what doesn't, in how you achieved your research objectives. Of course, you don't have to do that either, if you prefer to work on your research independently for a while longer. I'm glad the CC has been able to help you with your research, and we will continue to help you as best we can.
 

Offline dc55232

  • Junior Member 50+ posts
  • **
  • Posts: 67
Re: Capture NGC specimens
« Reply #8 on: 2013 Mar 18, 10:30:24 AM »
Hi badon,

I did all that to share the results. What we need in my opinion - in addition to a technical way to import the data - is a mapping of the NGC naming ("Label" plus "Denomination") to the CC types. This should be straightforward for types without varieties. If we have a one-to-one match of NGC varieties and CC varieties this works as well. For CC varieties without corresponding NGC varities it gets more complicated ... Maybe it would be easier to discuss the next steps via chat ...

Besides that I started to capture sightings for NGC graded coins from a chinese auction house.

Regards
dc55232
 

Offline badon

Re: Capture NGC specimens
« Reply #9 on: 2013 Mar 18, 01:17:17 PM »
Before we can do the mapping, I'd like to get a copy of the data so I can review its structure. I can add new features to the site to accommodate automated mass data-entry in a continuous process, but adding new features is sometimes time consuming to roll out to the entire site, so I need to plan it. For example, the mapping of type names can be easily implemented by making the "alternative names" a formal part of the CC's data structure. Then, the automated data entry bot will query the CC for the NGC type names, and if found, it will begin entering the specimens. It can continuously query the site for missing type names every few minutes, so as soon as someone adds the data to a type, the bot will eventually query it again and begin entering the data.

Of course, we can do a more expedient mapping in just a text file or a wiki page, to use temporarily while the bot waits for a successful query match as I'm rolling out the new features. The bots can use that same data to automatically add it to the CC's formal data structure, once the new features supporting it are rolled out.

How big is your compressed data file? Can you share it somehow? Can you share the data collection software too?

Once I have at least some samples of your research, I can start creating a plan for incorporating it into the CC. Most importantly, I will need to run some tests on our new infrastructure to be sure we don't add the data too quickly or too slowly. The CC is now one of the largest sites in the world of its kind. I will have to check around, but I think the addition of your research could make it the largest. That probably means there are unforeseen technical problems that will need to be handled carefully to avoid service disruption for the site's current users.

There are a few upgrades we need to do, and if I notice a problem of some kind, it will affect my decision to do the upgrade now, or wait for later. There are also some redesigns that need to be done to enhance performance, but those can possibly be done while the data is being entered more slowly. I like to prioritize automated tasks first so they can proceed while I work on non-automated tasks simultaneously, for maximum efficiency in the use of my time.

Also, I'm hoping to someday be able to bring in 1 or 2 specialist CC staff members that might be able to reduce my technical task work load, so I can focus on other things. If you share your software, it will help me decide how to most efficiently use the small amount of funding we have, and the time of CC researchers like you that contribute back the results of the research they produced from the CC's data. What we really need soon is a steering committee that can observe what is being done, and put some brainpower into deciding how best to utilize the resources we have in order to achieve our public service mission, but that's another subject...

One thing I forgot to mention is that it may be possible to create a very simple variety finder system, where people are presented with images for a specimen of unknown type, and they compare those images with all of the known varieties. Then, they use a multiple choice selection to identify the variety type A, B, C, D, or "New", etc. That will make sorting specimens in huge numbers possible. Being so simple, I'm sure with a little evangelism, we can persuade many ordinary collectors to help us identify all of the specimens. Advanced researchers can also re-run the identification process to search for new, undiscovered varieties.

That was tamo42's brilliant idea, and I think he called it the "type investigator", "type finder", or "variety detective" or something like that. I can configure the system to run in an automated mode where a specimen gets evaluated multiple times, and if each investigator produces the same result, then the specimen can drop out of the investigation system, or perhaps be evaluated much less frequently. In time, we will have high confidence that all every specimen has been examined by many experts, and every die has been identified, including fake dies.

There are more sophisticated things we can do with similarly simple systems like that, but I hope the technology and our infrastructure improves to handle the data querying burden that will create. Right now, with small improvements in our efficiency, I think we could free up enough funding to add 1 more staff member to do specimen sorting, if we have an efficient system like tamo42's type investigator idea. That way, we can use our most skilled experts for blazing new trails with new research that has never been done before, and leave the simpler tasks for less experienced researchers. Every time a major research project reaches a level of maturity that allows it to be broken down into simple components, we can create an opportunity for important research to be done by less skilled researchers.

New discoveries are rolling in quickly, thanks to the CC, but only because it makes new kinds of research possible for the first time in numismatic history. I think that trend will continue for several decades before CC researchers have fully explored the majority of interesting new ways to use the Coin Compendium's data.
 

Offline dc55232

  • Junior Member 50+ posts
  • **
  • Posts: 67
Re: Capture NGC specimens
« Reply #10 on: 2013 Mar 18, 08:45:06 PM »
badon,

my results are split into data sets per year. I attached the the smallest one which if for 1979. The structure per line of the text file is:

<cert no>;<label>;<denomination>;<grade>;<link to obv image or empty if there is no image>;<link to rev image or empty if there is no image>;

Regards
dc55232
 

Offline badon

Re: Capture NGC specimens
« Reply #11 on: 2013 Mar 21, 03:49:14 AM »
OK, I will begin working on a plan for getting this format of data into the CC. If you can share your bot code, I can work on getting it integrated into the CC to keep adding new stuff automagically.
 

Offline badon

Re: Capture NGC specimens
« Reply #12 on: 2013 Apr 17, 11:59:56 PM »
Here is an update:

Our test results tell us that the CC isn't responsive enough to enter your specimen research data as quickly as we would like because of all of the data that needs to load to display a page after it is saved, so we're going to move all of that statistical information into dynamic pages that will be generated only when a user requests it, and not on every page load. For example, instead of showing lists of sightings, maps of their locations, lists of images, etc, we'll only show counts of how many of each there are, and a link to a form for viewing more data. That will allow the wiki pages to be much more centered on article information that seems to be getting added a lot more frequently lately, and it should also greatly reduce the CC's server load from all the traffic it gets. Once that is done, then the CC can handle the increased load from entering many specimens as quickly as possible.

Another advantage of moving those information displays is that they can be customized and their features can be expanded, without disrupting the site. Right now, to make a tiny change would mean that many thousands of pages would have to be updated, and that process can take weeks sometimes. Many people have requested that the Cc become more user-friendly, and I think we have at least the beginnings of the infrastructure we need to handle the increased traffic load that will bring.

Another HUGE advantage to moving data off of type, specimen, and sighting pages is that the data entry team will be able to work much more quickly and efficiently. That should free up some funding to help us weather through funding shortages, or we could just use the extra funding to enter more data - that's probably what we'll do, since we can always scale back operations if we run low on funding.

So, there are a lot of benefits to this new plan of action we have, and I'm going to implement it myself immediately. We were unable to hire a replacement for me, which is disappointing, but also saves us money because I work for free :) Instead we have already hired one more data entry team member, with others currently in training. The staff we already have have been permitted to work overtime to make the most of their skills and experience in curating CC data, so the new staff members can take over the simpler tasks and allow our experienced people to advance to more difficult tasks.

One clearly visible thing (literally) that we have been using CC curating team members for is adding to the Coin Compendium image collection. They are critical for the users of the CC, and it's not a task that is structured enough to be easily automated. So, I guess this actually fits very well with our plan to progress toward increasing automation. Our current team is gradually moving away from tasks that are easily automated so they can gain experience with more difficult tasks that will eventually be a larger fraction of their responsibilities.

This CC forum is running on the same archival grade data storage system that the Coin Compendium runs on, so you can upload your data to this topic for safe keeping until we're ready to begin entering it. It is backed up frequently, and its data integrity is checked on our "live" system every time it is accessed, and any errors are automatically detected and corrected from one of our many duplicate copies we keep in the "live" system. Instead of being catastrophic, like in "normal" systems, error detection and correction is routine for us, and we don't even need to think about it anymore.

I'm not sure what our net MTBF is, but once something is written into storage without errors, I would expect it to be hundreds or thousands of years or more before a random error could occur in the same place in all of our live data copies simultaneously. And, in that case, we still have more copies in offsite backups. I'm sure by the time we approach our MTBF a few centuries from now, we will have increased the number of copies we keep in our live system to extend the MTBF for a few more centuries :)

We sacrifice performance by doing things this way, but if we can't preserve the work we're doing, there's no point in doing it. So, we must have data integrity as our #1 priority, and that has worked out well. There are always many other things that can possibly go wrong, but if our server farm gets nuked, or the Sun burps and blast's away Earth's atmosphere and fries our electronics, we will have bigger problems to worry about :)
 

Offline dc55232

  • Junior Member 50+ posts
  • **
  • Posts: 67
Re: Capture NGC specimens
« Reply #13 on: 2013 Apr 21, 03:48:49 AM »
badon,

thanks for the heads up! I really like the idea of moving the "dynamic data" to subpages. I think it will enhance user experience a lot ...

Currently I do have about 493000 specimens from NGC, about 100000 from PCGS, and thousands of sightings from zhaoonline related to specific NGC and PCGS numbers. Let alone all the additional pre-1979 data ...

I am still trying to figure out some more opportunities for automation, e.g. batch barcode recognition. I also spent some time with image comparison tools and libraries to automatically identify varieties but this requires some more investigation.

Regards
dc55232
 

Offline badon

Re: Capture NGC specimens
« Reply #14 on: 2013 Apr 21, 08:29:54 PM »
I doubt you'll have much success with machine vision on identifying varieties. Most of that tech requires consistent photography, with variables like lighting and lens type factored out. You might be able to get it to work for a few very specific varieties, so it's not impossible, but I would be very impressed if you succeeded with that much.

The batch barcode recognition is a much simpler problem, and the best barcode recognition I've tested is this one:

http://online-barcode-reader.inliteresearch.com/

Unfortunately, they designed their software to not be able to function on anything but Windows, so it's limited in how we could use it, and it's more expensive. But, it's the best I know of. It's the one that the CC team uses most of the time when they can't visually read the certification numbers.

The maximum achievable automation we can do is to integrate real humans into a semi-automated work flow. For example, the automated systems can present 2 images to a human, and ask him to indicate if they are the same, or different. Amazon's Mechanical Turk (mturk) is ideal for that kind of thing, and since it's so structured, it's very easy for non-experts to handle. We could even divide up images into quadrants to further reduce the complexity of the task.

We're not currently doing anything like that, but if we set up for it, I could probably get us some funding to keep a semi-automated system staffed with real humans to pick up the slack that the automated parts can't do.

Last I heard, I think NGC has about 26 million specimens. I don't know how many PCGS has. We only need 1 or 2 million specimens to make the CC the largest MediaWiki site in the world. Wikipedia is currently #1.

By the way, are you running on bots on the CC data? We've had some traffic spikes lately that have made the site temporarily unusable. I had to block access from bots a few times, and today, I had to block all unregistered users to keep the site functioning. We get huge numbers of bots on the CC, something like 25'000 of them each week. We have special systems in place to block those bots when they're causing problems for real users, but today they seemed to have been able to get past the blocks often enough to still cause a problem.

I'm already aware you're pulling down new data regularly from the CC, but that probably isn't enough to matter. I think you might be paging through the data so you can get more than 500 records at a time. If you're doing that without any delays between retrievals, that could be part of the problem.
 

Offline badon

Re: Capture NGC specimens
« Reply #15 on: 2013 Aug 29, 03:26:32 PM »
A couple of updates for this discussion:

I have noticed that NGC is blocking repeated specimen verifications sometimes, if I try to verify a few of them very quickly. They're probably trying to reduce the load on their servers, and slow the rate people use their verification service. Has that affected your research on certification numbers?

I really like the idea of moving the "dynamic data" to subpages. I think it will enhance user experience a lot ...

I forgot to post an update here when moving the dynamic data got finished. There is now a link on each page that takes you to essentially the exact same data that used to be on every page. However, it's MUCH easier to update the research tools we put on those data pages, because they do not need to be gradually rolled out to the whole site and all of its thousands of pages when a small update is made. All updates are now instantaneous!

That change has greatly reduced the load on the CC's servers, so our data entry team is much more efficient because they don't need to wait for all the data to be queried on each page load. Our automated bots are able to do tasks more quickly without noticeably affecting the site's performance, which is another bonus.
 

Offline DRAGON BREATH

Re: Capture NGC specimens
« Reply #16 on: 2017 Oct 03, 01:29:34 AM »
I see lots of info,about the bubble even goldfish here.. when ever I see them on feebay , I see silver plated coins and then other claiming to be silver. Is their 2 seperate series for 1990 goldfish ?
 

Offline Jeru

Re: Capture NGC specimens
« Reply #17 on: 2017 Oct 03, 03:58:29 AM »
I see lots of info,about the bubble even goldfish here.. when ever I see them on feebay , I see silver plated coins and then other claiming to be silver. Is their 2 seperate series for 1990 goldfish ?

CCT755: Goldfish

You can check that page, DRAGON BREATH.
"...but the greatest of these is love." -1 Corinthians 13:13
 

Offline badon

Re: Capture NGC specimens
« Reply #18 on: 2017 Oct 06, 08:37:53 PM »
There are 3 silver goldfish sets. 1984 silver with 1 fish on the reverse, 1984 silver plated brass with 1 fish on the reverse, and 1990 silver with 2 fish on the reverse.
 

Offline ZROGST

Re: Capture NGC specimens
« Reply #19 on: 2017 Nov 01, 05:38:48 PM »
Hello there o/ I have to admit that I am not 100% following the thread as I do not write script or work with data sets.  I am looking to compile a sort of design-oriented introduction to modern non-fiats though and images will be absolutely necessary.  Ultimately, if enough specimens have NGC Internet Imaging data that would eliminate a massive amount of work (Access to free high quality shots using the same professional equipment and lighting? count me in.);  Would you be able to tell me if this data exists?  I'm not sure how many submitters elect to use this service or if NGC Photo Vision data is mineable. Cheers.
 

Offline badon

Re: Capture NGC specimens
« Reply #20 on: 2017 Nov 02, 01:02:07 AM »
The CC has many tens of thousands of images in its catalog. NGC has images too, but they're very low quality. The way I am using the CC's image data to publish for people who don't have access to it is to screenshot a table of all the types in the series I'm interested in. For example, here's all the types in CCT6990: Panda coin collection expo: